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Preface 


Subject  Area  of  the  Book 

In  this  era,  where  a  huge  amount  of  information  from  different  fields  is  gathered  and 
stored,  its  analysis  and  the  extraction  of  value  have  become  one  of  the  most 
attractive  tasks  for  companies  and  society  in  general.  The  design  of  solutions  for  the 
new  questions  emerged  from  data  has  required  multidisciplinary  teams.  Computer 
scientists,  statisticians,  mathematicians,  biologists,  journalists  and  sociologists,  as 
well  as  many  others  are  now  working  together  in  order  to  provide  knowledge  from 
data.  This  new  interdisciplinary  field  is  called  data  science. 

The  pipeline  of  any  data  science  goes  through  asking  the  right  questions, 
gathering  data,  cleaning  data,  generating  hypothesis,  making  inferences,  visualizing 
data,  assessing  solutions,  etc. 


Organization  and  Feature  of  the  Book 

This  book  is  an  introduction  to  concepts,  techniques,  and  applications  in  data 
science.  This  book  focuses  on  the  analysis  of  data,  covering  concepts  from  statistics 
to  machine  learning,  techniques  for  graph  analysis  and  parallel  programming,  and 
applications  such  as  recommender  systems  or  sentiment  analysis. 

All  chapters  introduce  new  concepts  that  are  illustrated  by  practical  cases  using 
real  data.  Public  databases  such  as  Eurostat,  different  social  networks,  and 
MovieLens  are  used.  Specific  questions  about  the  data  are  posed  in  each  chapter. 
The  solutions  to  these  questions  are  implemented  using  Python  programming 
language  and  presented  in  code  boxes  properly  commented.  This  allows  the  reader 
to  learn  data  science  by  solving  problems  which  can  generalize  to  other  problems. 

This  book  is  not  intended  to  cover  the  whole  set  of  data  science  methods  neither 
to  provide  a  complete  collection  of  references.  Currently,  data  science  is  an 
increasing  and  emerging  field,  so  readers  are  encouraged  to  look  for  specific 
methods  and  references  using  keywords  in  the  net. 


v 
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Target  Audiences 

This  book  is  addressed  to  upper-tier  undergraduate  and  beginning  graduate  students 
from  technical  disciplines.  Moreover,  this  book  is  also  addressed  to  professional 
audiences  following  continuous  education  short  courses  and  to  researchers  from 
diverse  areas  following  self-study  courses. 

Basic  skills  in  computer  science,  mathematics,  and  statistics  are  required.  Code 
programming  in  Python  is  of  benefit.  However,  even  if  the  reader  is  new  to  Python, 
this  should  not  be  a  problem,  since  acquiring  the  Python  basics  is  manageable  in  a 
short  period  of  time. 


Previous  Uses  of  the  Materials 

Parts  of  the  presented  materials  have  been  used  in  the  postgraduate  course  of  Data 
Science  and  Big  Data  from  Universitat  de  Barcelona.  All  contributing  authors  are 
involved  in  this  course. 


Suggested  Uses  of  the  Book 

This  book  can  be  used  in  any  introductory  data  science  course.  The  problem-based 
approach  adopted  to  introduce  new  concepts  can  be  useful  for  the  beginners.  The 
implemented  code  solutions  for  different  problems  are  a  good  set  of  exercises  for 
the  students.  Moreover,  these  codes  can  serve  as  a  baseline  when  students  face 
bigger  projects. 


Supplemental  Resources 

This  book  is  accompanied  by  a  set  of  IPython  Notebooks  containing  all  the  codes 
necessary  to  solve  the  practical  cases  of  the  book.  The  Notebooks  can  be  found  on 
the  following  GitHub  repository:  https://github.com/DataScienceUB/introduction- 
datascience-python-book. 
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Introduction  to  Data  Science 


1 .1  What  is  Data  Science? 

You  have,  no  doubt,  already  experienced  data  science  in  several  forms.  When  you  are 
looking  for  information  on  the  web  by  using  a  search  engine  or  asking  your  mobile 
phone  for  directions,  you  are  interacting  with  data  science  products.  Data  science 
has  been  behind  resolving  some  of  our  most  common  daily  tasks  for  several  years. 

Most  of  the  scientific  methods  that  power  data  science  are  not  new  and  they  have 
been  out  there,  waiting  for  applications  to  be  developed,  for  a  long  time.  Statistics  is 
an  old  science  that  stands  on  the  shoulders  of  eighteenth-century  giants  such  as  Pierre 
Simon  Laplace  (1749-1827)  and  Thomas  Bayes  (1701-1761).  Machine  learning  is 
younger,  but  it  has  already  moved  beyond  its  infancy  and  can  be  considered  a  well- 
established  discipline.  Computer  science  changed  our  lives  several  decades  ago  and 
continues  to  do  so;  but  it  cannot  be  considered  new. 

So,  why  is  data  science  seen  as  a  novel  trend  within  business  reviews,  in  technology 
blogs,  and  at  academic  conferences? 

The  novelty  of  data  science  is  not  rooted  in  the  latest  scientific  knowledge,  but  in  a 
disruptive  change  in  our  society  that  has  been  caused  by  the  evolution  of  technology: 
datification.  Datification  is  the  process  of  rendering  into  data  aspects  of  the  world  that 
have  never  been  quantified  before.  At  the  personal  level,  the  list  of  datified  concepts 
is  very  long  and  still  growing:  business  networks,  the  lists  of  books  we  are  reading, 
the  films  we  enjoy,  the  food  we  eat,  our  physical  activity,  our  purchases,  our  driving 
behavior,  and  so  on.  Even  our  thoughts  are  datified  when  we  publish  them  on  our 
favorite  social  network;  and  in  a  not  so  distant  future,  your  gaze  could  be  datified  by 
wearable  vision  registering  devices.  At  the  business  level,  companies  are  datifying 
semi- structured  data  that  were  previously  discarded:  web  activity  logs,  computer 
network  activity,  machinery  signals,  etc.  Nonstructured  data,  such  as  written  reports, 
e-mails,  or  voice  recordings,  are  now  being  stored  not  only  for  archive  purposes  but 
also  to  be  analyzed. 
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However,  datification  is  not  the  only  ingredient  of  the  data  science  revolution.  The 
other  ingredient  is  the  democratization  of  data  analysis.  Large  companies  such  as 
Google,  Yahoo,  IBM,  or  SAS  were  the  only  players  in  this  field  when  data  science 
had  no  name.  At  the  beginning  of  the  century,  the  huge  computational  resources  of 
those  companies  allowed  them  to  take  advantage  of  datification  by  using  analytical 
techniques  to  develop  innovative  products  and  even  to  take  decisions  about  their 
own  business.  Today,  the  analytical  gap  between  those  companies  and  the  rest  of 
the  world  (companies  and  people)  is  shrinking.  Access  to  cloud  computing  allows 
any  individual  to  analyze  huge  amounts  of  data  in  short  periods  of  time.  Analytical 
knowledge  is  free  and  most  of  the  crucial  algorithms  that  are  needed  to  create  a 
solution  can  be  found,  because  open-source  development  is  the  norm  in  this  field.  As 
a  result,  the  possibility  of  using  rich  data  to  take  evidence-based  decisions  is  open 
to  virtually  any  person  or  company. 

Data  science  is  commonly  defined  as  a  methodology  by  which  actionable  insights 
can  be  inferred  from  data.  This  is  a  subtle  but  important  difference  with  respect  to 
previous  approaches  to  data  analysis,  such  as  business  intelligence  or  exploratory 
statistics.  Performing  data  science  is  a  task  with  an  ambitious  objective:  the  produc¬ 
tion  of  beliefs  informed  by  data  and  to  be  used  as  the  basis  of  decision-making.  In 
the  absence  of  data,  beliefs  are  uninformed  and  decisions,  in  the  best  of  cases,  are 
based  on  best  practices  or  intuition.  The  representation  of  complex  environments  by 
rich  data  opens  up  the  possibility  of  applying  all  the  scientific  knowledge  we  have 
regarding  how  to  infer  knowledge  from  data. 

In  general,  data  science  allows  us  to  adopt  four  different  strategies  to  explore  the 
world  using  data: 

1.  Probing  reality.  Data  can  be  gathered  by  passive  or  by  active  methods.  In  the 
latter  case,  data  represents  the  response  of  the  world  to  our  actions.  Analysis  of 
those  responses  can  be  extremely  valuable  when  it  comes  to  taking  decisions 
about  our  subsequent  actions.  One  of  the  best  examples  of  this  strategy  is  the 
use  of  A/B  testing  for  web  development:  What  is  the  best  button  size  and  color? 
The  best  answer  can  only  be  found  by  probing  the  world. 

2.  Pattern  discovery.  Divide  and  conquer  is  an  old  heuristic  used  to  solve  complex 
problems;  but  it  is  not  always  easy  to  decide  how  to  apply  this  common  sense  to 
problems.  Datified  problems  can  be  analyzed  automatically  to  discover  useful 
patterns  and  natural  clusters  that  can  greatly  simplify  their  solutions.  The  use 
of  this  technique  to  profile  users  is  a  critical  ingredient  today  in  such  important 
fields  as  programmatic  advertising  or  digital  marketing. 

3.  Predicting  future  events.  Since  the  early  days  of  statistics,  one  of  the  most  impor¬ 
tant  scientific  questions  has  been  how  to  build  robust  data  models  that  are  capa¬ 
ble  of  predicting  future  data  samples.  Predictive  analytics  allows  decisions  to 
be  taken  in  response  to  future  events,  not  only  reactively.  Of  course,  it  is  not 
possible  to  predict  the  future  in  any  environment  and  there  will  always  be  unpre¬ 
dictable  events;  but  the  identification  of  predictable  events  represents  valuable 
knowledge.  For  example,  predictive  analytics  can  be  used  to  optimize  the  tasks 
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planned  for  retail  store  staff  during  the  following  week,  by  analyzing  data  such 
as  weather,  historic  sales,  traffic  conditions,  etc. 

4.  Understanding  people  and  the  world.  This  is  an  objective  that  at  the  moment 
is  beyond  the  scope  of  most  companies  and  people,  but  large  companies  and 
governments  are  investing  considerable  amounts  of  money  in  research  areas 
such  as  understanding  natural  language,  computer  vision,  psychology  and  neu¬ 
roscience.  Scientific  understanding  of  these  areas  is  important  for  data  science 
because  in  the  end,  in  order  to  take  optimal  decisions,  it  is  necessary  to  know  the 
real  processes  that  drive  people’s  decisions  and  behavior.  The  development  of 
deep  learning  methods  for  natural  language  understanding  and  for  visual  object 
recognition  is  a  good  example  of  this  kind  of  research. 


1 .2  About  This  Book 

Data  science  is  definitely  a  cool  and  trendy  discipline  that  routinely  appears  in  the 
headlines  of  very  important  newspapers  and  on  TV  stations.  Data  scientists  are 
presented  in  those  forums  as  a  scarce  and  expensive  resource.  As  a  result  of  this 
situation,  data  science  can  be  perceived  as  a  complex  and  scary  discipline  that  is 
only  accessible  to  a  reduced  set  of  geniuses  working  for  major  companies.  The  main 
purpose  of  this  book  is  to  demystify  data  science  by  describing  a  set  of  tools  and 
techniques  that  allows  a  person  with  basic  skills  in  computer  science,  mathematics, 
and  statistics  to  perform  the  tasks  commonly  associated  with  data  science. 

To  this  end,  this  book  has  been  written  under  the  following  assumptions: 

•  Data  science  is  a  complex,  multifaceted  field  that  can  be  approached  from  sev¬ 
eral  points  of  view:  ethics,  methodology,  business  models,  how  to  deal  with  big 
data,  data  engineering,  data  governance,  etc.  Each  point  of  view  deserves  a  long 
and  interesting  discussion,  but  the  approach  adopted  in  this  book  focuses  on  ana¬ 
lytical  techniques,  because  such  techniques  constitute  the  core  toolbox  of  every 
data  scientist  and  because  they  are  the  key  ingredient  in  predicting  future  events, 
discovering  useful  patterns,  and  probing  the  world. 

•  You  have  some  experience  with  Python  programming.  For  this  reason,  we  do  not 
offer  an  introduction  to  the  language.  But  even  if  you  are  new  to  Python,  this  should 
not  be  a  problem.  Before  reading  this  book  you  should  start  with  any  online  Python 
course.  Mastering  Python  is  not  easy,  but  acquiring  the  basics  is  a  manageable  task 
for  anyone  in  a  short  period  of  time. 

•  Data  science  is  about  evidence-based  storytelling  and  this  kind  of  process  requires 
appropriate  tools.  The  Python  data  science  toolbox  is  one,  not  the  only,  of  the 
most  developed  environments  for  doing  data  science.  You  can  easily  install  all  you 
need  by  using  Anaconda1 :  a  free  product  that  includes  a  programming  language 


1  https  ://ww  w.continuum.io/downloads . 
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(Python),  an  interactive  environment  to  develop  and  present  data  science  projects 
(Jupyter  notebooks),  and  most  of  the  toolboxes  necessary  to  perform  data  analysis. 

•  Learning  by  doing  is  the  best  approach  to  learn  data  science.  For  this  reason  all  the 
code  examples  and  data  in  this  book  are  available  to  download  at  https  ://github. 
com/DataScienceUB/introduction-datascience- python- book. 

•  Data  science  deals  with  solving  real-world  problems.  So  all  the  chapters  in  the 
book  include  and  discuss  practical  cases  using  real  data. 

This  book  includes  three  different  kinds  of  chapters.  The  first  kind  is  about  Python 
extensions.  Python  was  originally  designed  to  have  a  minimum  number  of  data 
objects  (int,  float,  string,  etc.);  but  when  dealing  with  data,  it  is  necessary  to  extend 
the  native  set  to  more  complex  objects  such  as  (numpy)  numerical  arrays  or  (pandas) 
data  frames.  The  second  kind  of  chapter  includes  techniques  and  modules  to  per¬ 
form  statistical  analysis  and  machine  learning.  Finally,  there  are  some  chapters  that 
describe  several  applications  of  data  science,  such  as  building  recommenders  or  sen¬ 
timent  analysis.  The  composition  of  these  chapters  was  chosen  to  offer  a  panoramic 
view  of  the  data  science  field,  but  we  encourage  the  reader  to  delve  deeper  into  these 
topics  and  to  explore  those  topics  that  have  not  been  covered:  big  data  analytics,  deep 
learning  techniques,  and  more  advanced  mathematical  and  statistical  methods  (e.g., 
computational  algebra  and  Bayesian  statistics). 

Acknowledgements  This  chapter  was  co-written  by  Jordi  Vitria. 
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2.1  Introduction 

In  this  chapter,  first  we  introduce  some  of  the  tools  that  data  scientists  use.  The  toolbox 
of  any  data  scientist,  as  for  any  kind  of  programmer,  is  an  essential  ingredient  for 
success  and  enhanced  performance.  Choosing  the  right  tools  can  save  a  lot  of  time 
and  thereby  allow  us  to  focus  on  data  analysis. 

The  most  basic  tool  to  decide  on  is  which  programming  language  we  will  use. 
Many  people  use  only  one  programming  language  in  their  entire  life:  the  first  and 
only  one  they  learn.  For  many,  learning  a  new  language  is  an  enormous  task  that,  if 
at  all  possible,  should  be  undertaken  only  once.  The  problem  is  that  some  languages 
are  intended  for  developing  high-performance  or  production  code,  such  as  C,  C++, 
or  Java,  while  others  are  more  focused  on  prototyping  code,  among  these  the  best 
known  are  the  so-called  scripting  languages:  Ruby,  Perl,  and  Python.  So,  depending 
on  the  first  language  you  learned,  certain  tasks  will,  at  the  very  least,  be  rather  tedious. 
The  main  problem  of  being  stuck  with  a  single  language  is  that  many  basic  tools 
simply  will  not  be  available  in  it,  and  eventually  you  will  have  either  to  reimplement 
them  or  to  create  a  bridge  to  use  some  other  language  just  for  a  specific  task. 
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In  conclusion,  you  either  have  to  be  ready  to  change  to  the  best  language  for  each 
task  and  then  glue  the  results  together,  or  choose  a  very  flexible  language  with  a  rich 
ecosystem  (e.g.,  third-party  open-source  libraries).  In  this  book  we  have  selected 
Python  as  the  programming  language. 


2.2  Why  Python? 

Python  is  a  mature  programming  language  but  it  also  has  excellent  properties  for 
newbie  programmers,  making  it  ideal  for  people  who  have  never  programmed  before. 
Some  of  the  most  remarkable  of  those  properties  are  easy  to  read  code,  suppression 
of  non-mandatory  delimiters,  dynamic  typing,  and  dynamic  memory  usage.  Python 
is  an  interpreted  language,  so  the  code  is  executed  immediately  in  the  Python  con¬ 
sole  without  needing  the  compilation  step  to  machine  language.  Besides  the  Python 
console  (which  comes  included  with  any  Python  installation)  you  can  find  other  in¬ 
teractive  consoles,  such  as  IPython,1 2  which  give  you  a  richer  environment  in  which 
to  execute  your  Python  code. 

Currently,  Python  is  one  of  the  most  flexible  programming  languages.  One  of  its 
main  characteristics  that  makes  it  so  flexible  is  that  it  can  be  seen  as  a  multiparadigm 
language.  This  is  especially  useful  for  people  who  already  know  how  to  program  with 
other  languages,  as  they  can  rapidly  start  programming  with  Python  in  the  same  way. 
For  example,  Java  programmers  will  feel  comfortable  using  Python  as  it  supports 
the  object-oriented  paradigm,  or  C  programmers  could  mix  Python  and  C  code  using 
cython.  Furthermore,  for  anyone  who  is  used  to  programming  in  functional  languages 
such  as  Haskell  or  Lisp,  Python  also  has  basic  statements  for  functional  programming 
in  its  own  core  library. 

In  this  book,  we  have  decided  to  use  Python  language  because,  as  explained 
before,  it  is  a  mature  language  programming,  easy  for  the  newbies,  and  can  be  used 
as  a  specific  platform  for  data  scientists,  thanks  to  its  large  ecosystem  of  scientific 
libraries  and  its  high  and  vibrant  community.  Other  popular  alternatives  to  Python 
for  data  scientists  are  R  and  MATLAB/Octave. 


2.3  Fundamental  Python  Libraries  for  Data  Scientists 

The  Python  community  is  one  of  the  most  active  programming  communities  with  a 
huge  number  of  developed  toolboxes.  The  most  popular  Python  toolboxes  for  any 
data  scientist  are  NumPy,  SciPy,  Pandas,  and  Scikit-Learn. 


1  https  ://www.  python.org/downloads/. 

2http://ipython.org/install.html. 
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2.3.1  Numeric  and  Scientific  Computation:  NumPy  and  SciPy 

NumPy 3 4 5  is  the  cornerstone  toolbox  for  scientific  computing  with  Python.  NumPy 
provides,  among  other  things,  support  for  multidimensional  arrays  with  basic  oper¬ 
ations  on  them  and  useful  linear  algebra  functions.  Many  toolboxes  use  the  NumPy 
array  representations  as  an  efficient  basic  data  structure.  Meanwhile,  SciPy  provides 
a  collection  of  numerical  algorithms  and  domain- specific  toolboxes,  including  signal 
processing,  optimization,  statistics,  and  much  more.  Another  core  toolbox  in  SciPy 
is  the  plotting  library  Matplotlib.  This  toolbox  has  many  tools  for  data  visualization. 


2.3.2  SCIKIT-Learn:  Machine  Learning  in  Python 

Scikit-learn  is  a  machine  learning  library  built  from  NumPy,  SciPy,  and  Matplotlib. 
Scikit-learn  offers  simple  and  efficient  tools  for  common  tasks  in  data  analysis  such 
as  classification,  regression,  clustering,  dimensionality  reduction,  model  selection, 
and  preprocessing. 


2.3.3  PANDAS:  Python  Data  Analysis  Library 

Pandas'  provides  high-performance  data  structures  and  data  analysis  tools.  The  key 
feature  of  Pandas  is  a  fast  and  efficient  DataFrame  object  for  data  manipulation  with 
integrated  indexing.  The  DataFrame  structure  can  be  seen  as  a  spreadsheet  which 
offers  very  flexible  ways  of  working  with  it.  You  can  easily  transform  any  dataset  in 
the  way  you  want,  by  reshaping  it  and  adding  or  removing  columns  or  rows.  It  also 
provides  high-performance  functions  for  aggregating,  merging,  and  joining  dataset- 
s.  Pandas  also  has  tools  for  importing  and  exporting  data  from  different  formats: 
comma- separated  value  (CSV),  text  files,  Microsoft  Excel,  SQL  databases,  and  the 
fast  HDF5  format.  In  many  situations,  the  data  you  have  in  such  formats  will  not 
be  complete  or  totally  structured.  For  such  cases,  Pandas  offers  handling  of  miss¬ 
ing  data  and  intelligent  data  alignment.  Furthermore,  Pandas  provides  a  convenient 
Matplotlib  interface. 


2.4  Data  Science  Ecosystem  Installation 

Before  we  can  get  started  on  solving  our  own  data-oriented  problems,  we  will  need  to 
set  up  our  programming  environment.  The  first  question  we  need  to  answer  concerns 


3  http :  // w  w  w.  scipy.  org/scipy  lib/download  .html . 

4  http :  // w  w  w.  scipy.  org/scipy  lib/download  .html . 

5http://pandas.pydata.org/getpandas.html. 
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Python  language  itself.  There  are  currently  two  different  versions  of  Python:  Python 
2.X  and  Python  3.X.  The  differences  between  the  versions  are  important,  so  there  is 
no  compatibility  between  the  codes,  i.e.,  code  written  in  Python  2.X  does  not  work 
in  Python  3.X  and  vice  versa.  Python  3.X  was  introduced  in  late  2008;  by  then,  a  lot 
of  code  and  many  toolboxes  were  already  deployed  using  Python  2.X  (Python  2.0 
was  initially  introduced  in  2000).  Therefore,  much  of  the  scientific  community  did 
not  change  to  Python  3.0  immediately  and  they  were  stuck  with  Python  2.7.  By  now, 
almost  all  libraries  have  been  ported  to  Python  3.0;  but  Python  2.7  is  still  maintained, 
so  one  or  another  version  can  be  chosen.  However,  those  who  already  have  a  large 
amount  of  code  in  2.X  rarely  change  to  Python  3.X.  In  our  examples  throughout  this 
book  we  will  use  Python  2.7. 

Once  we  have  chosen  one  of  the  Python  versions,  the  next  thing  to  decide  is 
whether  we  want  to  install  the  data  scientist  Python  ecosystem  by  individual  tool¬ 
boxes,  or  to  perform  a  bundle  installation  with  all  the  needed  toolboxes  (and  a  lot 
more).  For  newbies,  the  second  option  is  recommended.  If  the  first  option  is  chosen, 
then  it  is  only  necessary  to  install  all  the  mentioned  toolboxes  in  the  previous  section, 
in  exactly  that  order. 

However,  if  a  bundle  installation  is  chosen,  the  Anaconda  Python  distribution6 7 8 
is  then  a  good  option.  The  Anaconda  distribution  provides  integration  of  all  the 
Python  toolboxes  and  applications  needed  for  data  scientists  into  a  single  directory 
without  mixing  it  with  other  Python  toolboxes  installed  on  the  machine.  It  contain- 
s,  of  course,  the  core  toolboxes  and  applications  such  as  NumPy,  Pandas,  SciPy, 
Matplotlib,  Scikit-learn,  IPython,  Spyder,  etc.,  but  also  more  specific  tools  for  other 
related  tasks  such  as  data  visualization,  code  optimization,  and  big  data  processing. 


2.5  Integrated  Development  Environments  (IDE) 

For  any  programmer,  and  by  extension,  for  any  data  scientist,  the  integrated  de¬ 
velopment  environment  (IDE)  is  an  essential  tool.  IDEs  are  designed  to  maximize 
programmer  productivity.  Thus,  over  the  years  this  software  has  evolved  in  order  to 
make  the  coding  task  less  complicated.  Choosing  the  right  IDE  for  each  person  is 
crucial  and,  unfortunately,  there  is  no  “one-size-fits-all”  programming  environment. 
The  best  solution  is  to  try  the  most  popular  IDEs  among  the  community  and  keep 
whichever  fits  better  in  each  case. 

In  general,  the  basic  pieces  of  any  IDE  are  three:  the  editor,  the  compiler,  (or 
interpreter)  and  the  debugger.  Some  IDEs  can  be  used  in  multiple  programming 
languages,  provided  by  language- specific  plugins,  such  as  Netbeans  or  Eclipse. 
Others  are  only  specific  for  one  language  or  even  a  specific  programming  task.  In 


6  http://continuum.io/downloads. 

7https://netbeans.org/downloads/. 

8https://eclipse.org/downloads/. 
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the  case  of  Python,  there  are  a  large  number  of  specific  IDEs,  both  commercial 
(PyCharm,9  WingIDE  1  ...)  and  open-source.  The  open-source  community  helps 
IDEs  to  spring  up,  thus  anyone  can  customize  their  own  environment  and  share  it  with 
the  rest  of  the  community.  For  example,  Spyder  (Scientific  Python  Development 
EnviRonment)  is  an  IDE  customized  with  the  task  of  the  data  scientist  in  mind. 


2.5.1  Web  Integrated  Development  Environment  (WIDE):  Jupyter 

With  the  advent  of  web  applications,  a  new  generation  of  IDEs  for  interactive  lan¬ 
guages  such  as  Python  has  been  developed.  Starting  in  the  academia  and  e-learning 
communities,  web-based  IDEs  were  developed  considering  how  not  only  your  code 
but  also  all  your  environment  and  executions  can  be  stored  in  a  server.  One  of  the 
first  applications  of  this  kind  of  WIDE  was  developed  by  William  Stein  in  early  2005 
using  Python  2.3  as  part  of  his  SageMath  mathematical  software.  In  SageMath,  a 
server  can  be  set  up  in  a  center,  such  as  a  university  or  school,  and  then  students  can 
work  on  their  homework  either  in  the  classroom  or  at  home,  starting  from  exactly  the 
same  point  they  left  off.  Moreover,  students  can  execute  all  the  previous  steps  over 
and  over  again,  and  then  change  some  particular  code  cell  (a  segment  of  the  docu¬ 
ment  that  may  content  source  code  that  can  be  executed)  and  execute  the  operation 
again.  Teachers  can  also  have  access  to  student  sessions  and  review  the  progress  or 
results  of  their  pupils. 

Nowadays,  such  sessions  are  called  notebooks  and  they  are  not  only  used  in 
classrooms  but  also  used  to  show  results  in  presentations  or  on  business  dashboards. 
The  recent  spread  of  such  notebooks  is  mainly  due  to  IPython.  Since  December  2011, 
IPython  has  been  issued  as  a  browser  version  of  its  interactive  console,  called  IPython 
notebook,  which  shows  the  Python  execution  results  very  clearly  and  concisely  by 
means  of  cells.  Cells  can  contain  content  other  than  code.  For  example,  markdown  (a 
wiki  text  language)  cells  can  be  added  to  introduce  algorithms.  It  is  also  possible  to 
insert  Matplotlib  graphics  to  illustrate  examples  or  even  web  pages.  Recently,  some 
scientific  journals  have  started  to  accept  notebooks  in  order  to  show  experimental 
results,  complete  with  their  code  and  data  sources.  In  this  way,  experiments  can 
become  completely  and  absolutely  replicable. 

Since  the  project  has  grown  so  much,  IPython  notebook  has  been  separated  from 
IPython  software  and  now  it  has  become  a  part  of  a  larger  project:  Jupyter  .  Jupyter 
(for  Julia,  Python  and  R)  aims  to  reuse  the  same  WIDE  for  all  these  interpreted 
languages  and  not  just  Python.  All  old  IPython  notebooks  are  automatically  imported 
to  the  new  version  when  they  are  opened  with  the  Jupyter  platform;  but  once  they 


9  https :// w  w  w.j  etbrains .  com/py  charm/ . 

10https:// wingware.com/. 

1 1  https://github.com/spyder-ide/spyder. 
12http://jupyter.readthedocs.org/en/latest/install.html. 
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are  converted  to  the  new  version,  they  cannot  be  used  again  in  old  IPython  notebook 
versions. 

In  this  book,  all  the  examples  shown  use  Jupyter  notebook  style. 


2.6  Get  Started  with  Python  for  Data  Scientists 

Throughout  this  book,  we  will  come  across  many  practical  examples.  In  this  chapter, 
we  will  see  a  very  basic  example  to  help  get  started  with  a  data  science  ecosystem 
from  scratch.  To  execute  our  examples,  we  will  use  Jupyter  notebook,  although  any 
other  console  or  IDE  can  be  used. 


The  Jupyter  Notebook  Environment 


Once  all  the  ecosystem  is  fully  installed,  we  can  start  by  launching  the  Jupyter 
notebook  platform.  This  can  be  done  directly  by  typing  the  following  command  on 
your  terminal  or  command  line:  $  jupyter  notebook 

If  we  chose  the  bundle  installation,  we  can  start  the  Jupyter  notebook  platform  by 
clicking  on  the  Jupyter  Notebook  icon  installed  by  Anaconda  in  the  start  menu  or  on 
the  desktop. 

The  browser  will  immediately  be  launched  displaying  the  Jupyter  notebook  home- 
page,  whose  URL  is  http://localhost:8888/tree.  Note  that  a  special  port  is  used;  by 
default  it  is  8888.  As  can  be  seen  in  Fig.  2.1,  this  initial  page  displays  a  tree  view  of  a 
directory.  If  we  use  the  command  line,  the  root  directory  is  the  same  directory  where 
we  launched  the  Jupyter  notebook.  Otherwise,  if  we  use  the  Anaconda  launcher,  the 
root  directory  is  the  current  user  directory.  Now,  to  start  a  new  notebook,  we  only 


need  to  press  the  |  New  Notebooks ))  Python^]  button  at  the  top  on  the  right  of  the 
home  page. 

As  can  be  seen  in  Fig. 2.2,  a  blank  notebook  is  created  called  Untitled. 
First  of  all,  we  are  going  to  change  the  name  of  the  notebook  to  something 
more  appropriate.  To  do  this,  just  click  on  the  notebook  name  and  rename  it: 
DataScience-GetStartedExample. 

Let  us  begin  by  importing  those  toolboxes  that  we  will  need  for  our  program.  In  the 
first  cell  we  put  the  code  to  import  the  Pandas  library  as  pd.  This  is  for  convenience; 
every  time  we  need  to  use  some  functionality  from  the  Pandas  library,  we  will  write 
pd  instead  of  pandas.  We  will  also  import  the  two  core  libraries  mentioned  above: 
the  numpy  library  as  np  and  the  matplotlib  library  as  pit. 


\ 

imp  o  r  t 

pandas  as  pd 

imp  o  r  t 

numpy  as  np 

imp  o  r  t 

ma t p 1 o t 1 ib . pyp 1 o t  as  pit 

v 

_ y 

In  []  : 
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£  jupyter 


hies  Running  Clusters 

Setset  items  to  parior m  actions  on  t!wn,  Uoioad  * 

-  * 

□  #  Pandas' BookJpyi’fo 


Fig.  2.1  IPython  notebook  home  page,  displaying  a  home  tree  directory 


*»* 


e 


jupyter  untied 


JjlL.-ijvM, 


Fla  Edit  View  ln»Pl  Can  KamtH  Halp  /  Python  2  O 

K  +  »:(£]£  +  +  KBC  Code  i  Cell  Toolbar;  Wona  | 


I"  [  1J  I 


r 


Fig.  2.2  An  empty  new  notebook 


To  execute  just  one  cell,  we  press  the  N  button  or  click  on  [  Cell  ))~Run  or  press 
the  keys  [Ctrl  [+[ Enter .  While  execution  is  underway,  the  header  of  the  cell  shows  the 
*  mark: 


f 

\ 

imp  o  r  t 

pandas  as  pd 

imp  o  r  t 

numpy  as  np 

imp  o  r  t 

ma t p 1 o t 1 ib . pyp 1 o t  as  pit 

A 

_ y 

In  [*]  : 
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While  a  cell  is  being  executed,  no  other  cell  can  be  executed.  If  you  try  to  execute 
another  cell,  its  execution  will  not  start  until  the  first  cell  has  finished  its  execution. 

Once  the  execution  is  finished,  the  header  of  the  cell  will  be  replaced  by  the  next 
number  of  execution.  Since  this  will  be  the  first  cell  executed,  the  number  shown  will 
be  1 .  If  the  process  of  importing  the  libraries  is  correct,  no  output  cell  is  produced. 


\ 

import 

pandas  as  pd 

imp  o  r  t 

numpy  as  np 

imp  o  r  t 

ma t p 1 o t 1 ib . pyp 1 o t  as  pit 

_ 

_ / 

For  simplicity,  other  chapters  in  this  book  will  avoid  writing  these  imports. 


The  DataFrame  Data  Structure 


In  [2]  : 


The  key  data  structure  in  Pandas  is  the  DataFrame  object.  A  DataFrame  is  basically 
a  tabular  data  structure,  with  rows  and  columns.  Rows  have  a  specific  index  to  access 
them,  which  can  be  any  name  or  value.  In  Pandas,  the  columns  are  called  Series, 
a  special  type  of  data,  which  in  essence  consists  of  a  list  of  several  values,  where 
each  value  has  an  index.  Therefore,  the  DataFrame  data  structure  can  be  seen  as  a 
spreadsheet,  but  it  is  much  more  flexible.  To  understand  how  it  works,  let  us  see 
how  to  create  a  DataFrame  from  a  common  Python  dictionary  of  lists.  First,  we  will 
create  a  new  cell  by  clicking  [insert})  Insert  Cell  Below  or  pressing  the  keys  [ Ctrl ]+[~B~. 
Then,  we  write  in  the  following  code: 


data  =  {  ' year  '  :  [ 

2010  ,  2011  ,  2  012  , 

2010  ,  2011  ,  2  012  , 

2010,  2011,  2012 

]  , 

'  team  '  :  [ 

' FCBarcelona  '  ,  '  FCBarcelona  '  , 

'  FCBarcelona  '  ,  ' RMadri d  '  , 

' RMadrid  '  ,  ' RMadrid  '  , 

' ValenciaCF  '  ,  ' ValenciaCF  '  , 

' Va 1 enc iaCF  ' 


]  , 


'wins ' : 

[30 

to 

00 

32  , 

to 

32  , 

2  6  , 

21 ,  17  ,  19 ]  , 

' draws  '  : 

[6  , 

7  ,  4  , 

5  , 

4  , 

7  , 

8  , 

10  , 

8]  , 

'  losses  '  : 

[2  , 

3  ,  2  , 

4  , 

2  , 

5  , 

9  , 

11  , 

11  ] 

} 

football  =  pd . Da taFrame ( data ,  columns  =  [ 

'year',  'team',  'wins',  'draws',  'losses' 
] 

) 


v 


y 


In  this  example,  we  use  the  pandas  DataFrame  object  constructor  with  a  dictionary 
of  lists  as  argument.  The  value  of  each  entry  in  the  dictionary  is  the  name  of  the 
column,  and  the  lists  are  their  values. 

The  DataFrame  columns  can  be  arranged  at  construction  time  by  entering  a  key¬ 
word  columns  with  a  list  of  the  names  of  the  columns  ordered  as  we  want.  If  the 
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Out  [  2  ]  : 


column  keyword  is  not  present  in  the  constructor,  the  columns  will  be  arranged  in 
alphabetical  order.  Now,  if  we  execute  this  cell,  the  result  will  be  a  table  like  this: 


year 

team 

wins 

draws 

losses 

0 

2010 

FCBarcelona 

30 

6 

2 

1 

2011 

FCBarcelona 

28 

7 

3 

2 

2012 

FCBarcelona 

32 

4 

2 

3 

2010 

RMadrid 

29 

5 

4 

4 

2011 

RMadrid 

32 

4 

2 

5 

2012 

RMadrid 

26 

7 

5 

6 

2010 

ValenciaCF 

21 

8 

9 

7 

2011 

ValenciaCF 

17 

10 

11 

8 

2012 

ValenciaCF 

19 

8 

11 

where  each  entry  in  the  dictionary  is  a  column.  The  index  of  each  row  is  created 
automatically  taking  the  position  of  its  elements  inside  the  entry  lists,  starting  from  0. 
Although  it  is  very  easy  to  create  DataFrames  from  scratch,  most  of  the  time  what 
we  will  need  to  do  is  import  chunks  of  data  into  a  DataFrame  structure,  and  we  will 
see  how  to  do  this  in  later  examples. 

Apart  from  DataFrame  data  structure  creation,  Panda  offers  a  lot  of  functions 
to  manipulate  them.  Among  other  things,  it  offers  us  functions  for  aggregation, 
manipulation,  and  transformation  of  the  data.  In  the  following  sections,  we  will 
introduce  some  of  these  functions. 


Open  Government  Data  Analysis  Example  Using  Pandas 

To  illustrate  how  we  can  use  Pandas  in  a  simple  real  problem,  we  will  start  doing 
some  basic  analysis  of  government  data.  For  the  sake  of  transparency,  data  produced 
by  government  entities  must  be  open,  meaning  that  they  can  be  freely  used,  reused, 
and  distributed  by  anyone.  An  example  of  this  is  the  Eurostat,  which  is  the  home  of 
European  Commission  data.  Eurostat’s  main  role  is  to  process  and  publish  compa¬ 
rable  statistical  information  at  the  European  level.  The  data  in  Eurostat  are  provided 
by  each  member  state  and  it  is  free  to  reuse  them,  for  both  noncommercial  and 
commercial  purposes  (with  some  minor  exceptions). 

Since  the  amount  of  data  in  the  Eurostat  database  is  huge,  in  our  first  study  we 
are  only  going  to  focus  on  data  relative  to  indicators  of  educational  funding  by  the 
member  states.  Thus,  the  first  thing  to  do  is  to  retrieve  such  data  from  Eurostat. 
Since  open  data  have  to  be  delivered  in  a  plain  text  format,  CSV  (or  any  other 
delimiter- separated  value)  formats  are  commonly  used  to  store  tabular  data.  In  a 
delimiter- separated  value  file,  each  line  is  a  data  record  and  each  record  consist- 
s  of  one  or  more  fields,  separated  by  the  delimiter  character  (usually  a  comma). 
Therefore,  the  data  we  will  use  can  be  found  already  processed  at  book’s  Github 
repository  as  educ_f  igdp_l_Data  .  csvfile.  Of  course,  it  can  also  be  download¬ 
ed  as  unprocessed  tabular  data  from  the  Eurostat  database  site  3  following  the  path: 
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In  [1]  : 


Out  [  1  ]  : 


Tables  by  themes  ))Population  and  social  conditions)))  Education  and  training)))  Education 
Indicators  on  education  finance  ))Public  expenditure  on  education  . 


2.6.1  Reading 

Let  us  start  reading  the  data  we  downloaded.  First  of  all,  we  have  to  create  a  new 
notebook  called  Open  Government  Data  Analysis  and  open  it.  Then,  after 
ensuring  that  the  educ_f  igdp_l_Data  .  csv  file  is  stored  in  the  same  directory 
as  our  notebook  directory,  we  will  write  the  following  code  to  read  and  show  the 
content: 


r 

edu 

=  pd . read_csv ( ' files/ch02 /educ_f igdp_l_Data . csv ' , 

\ 

na_values  =  '  :  '  , 

usecols  =  [ "TIME" , "GEO" , "Value" ] ) 

edu 

V 

_ y 

TIME 

GEO 

Value 

0 

2000 

European  Union  . . . 

NaN 

1 

2001 

European  Union  . . . 

NaN 

2 

2002 

European  Union  . . . 

5.00 

3 

2003 

European  Union  . . . 

5.03 

.  .  . 

.  .  . 

.  .  . 

.  .  . 

382 

2010 

Finland 

6.85 

383 

2011 

Finland 

6.76 

384  rows  x  5  columns 


The  way  to  read  CSV  (or  any  other  separated  value,  providing  the  separator 
character)  files  in  Pandas  is  by  calling  the  read_csv  method.  Besides  the  name 
of  the  file,  we  add  the  na_values  key  argument  to  this  method  along  with  the 
character  that  represents  “non  available  data”  in  the  file.  Normally,  CSV  files  have  a 
header  with  the  names  of  the  columns.  If  this  is  the  case,  we  can  use  the  usecols 
parameter  to  select  which  columns  in  the  file  will  be  used. 

In  this  case,  the  DataFrame  resulting  from  reading  our  data  is  stored  in  edu.  The 
output  of  the  execution  shows  that  the  edu  DataFrame  size  is  384  rows  x  3  columns. 
Since  the  DataFrame  is  too  large  to  be  fully  displayed,  three  dots  appear  in  the  middle 
of  each  row. 

Beside  this,  Pandas  also  has  functions  for  reading  files  with  formats  such  as  Excel, 
HDF5,  tabulated  files,  or  even  the  content  from  the  clipboard  (read_excel  ( ) , 
read_hdf  ( ) ,  read_table  ( ) ,  read_clipboard  ( ) ).  Whichever  function 
we  use,  the  result  of  reading  a  file  is  stored  as  a  DataFrame  structure. 

To  see  how  the  data  looks,  we  can  use  the  head  ( )  method,  which  shows  just  the 
first  five  rows.  If  we  use  a  number  as  an  argument  to  this  method,  this  will  be  the 
number  of  rows  that  will  be  listed: 


13http://ec.europa.eu/eurostat/data/database. 
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In  [2]  : 

f 

N 

edu . head  (  ) 

_ / 

TIME 

GEO 

Value 

0 

2000 

European 

Union  . . . 

NaN 

1 

2001 

European 

Union  . . . 

NaN 

2 

2002 

European 

Union  . . . 

5.00 

3 

2003 

European 

Union  . . . 

5.03 

4 

2004 

European 

Union  . . . 

4.95 

Similarly,  it  exists  the  t  a  i  1  ( )  method,  which  returns  the  last  five  rows  by  default. 


[3]  : 

f 

edu .  tail  (  ) 

_ 

379 

2007 

Finland 

5.90 

380 

2008 

Finland 

6.10 

381 

2009 

Finland 

6.81 

382 

2010 

Finland 

6.85 

383 

2011 

Finland 

6.76 

If  we  want  to  know  the  names  of  the  columns  or  the  names  of  the  indexes,  we 
can  use  the  DataFrame  attributes  columns  and  index  respectively.  The  names  of 
the  columns  or  indexes  can  be  changed  by  assigning  a  new  list  of  the  same  length  to 
these  attributes.  The  values  of  any  DataFrame  can  be  retrieved  as  a  Python  array  by 
calling  its  values  attribute. 

If  we  just  want  quick  statistical  information  on  all  the  numeric  columns  in  a 
DataFrame,  we  can  use  the  function  describe  ( ) .  The  result  shows  the  count,  the 
mean,  the  standard  deviation,  the  minimum  and  maximum,  and  the  percentiles,  by 
default,  the  25th,  50th,  and  75th,  for  all  the  values  in  each  column  or  series. 


In  [4]  : 

r 

N 

edu . de s c r ibe  (  ) 

_ y 

TIME 

Value 

count 

384 .000000 

361.000000 

mean 

2005.500000 

5.203989 

std 

3.456556 

1.021694 

min 

2000.000000 

2.880000 

25% 

2002.750000 

4.620000 

50% 

2005.500000 

5.060000 

75% 

2008.250000 

5.660000 

max 

2011.000000 

8.810000 

Name:  Value,  dtype :  float64 
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In  [  5  ]  : 


Out  [  5  ]  : 


In  [  6  ]  : 


Out  [  6  ]  : 


2.6.2  Selecting  Data 

If  we  want  to  select  a  subset  of  data  from  a  DataFrame,  it  is  necessary  to  indicate  this 
subset  using  square  brackets  ( [  ] )  after  the  DataFrame.  The  subset  can  be  specified 
in  several  ways.  If  we  want  to  select  only  one  column  from  a  DataFrame,  we  only 
need  to  put  its  name  between  the  square  brackets.  The  result  will  be  a  Series  data 
structure,  not  a  DataFrame,  because  only  one  column  is  retrieved. 

f  \ 

edu [ ' Value ' ] 

\ _ / 


0  NaN 

1  NaN 

2  5.00 

3  5.03 

4  4.95 

380  6.10 

381  6.81 

382  6.85 

383  6.76 

Name:  Value,  dtype :  float64 


If  we  want  to  select  a  subset  of  rows  from  a  DataFrame,  we  can  do  so  by  indicating 
a  range  of  rows  separated  by  a  colon  ( : )  inside  the  square  brackets.  This  is  commonly 
known  as  a  slice  of  rows: 

f  \ 

edu  [10:14] 

\ _ / 


TIME 

GEO 

Value 

10 

2010 

European  Union  (28  countries) 

5.41 

11 

2011 

European  Union  (28  countries) 

5.25 

12 

2000 

European  Union  (27  countries) 

4.91 

13 

2001 

European  Union  (27  countries) 

4.99 

This  instruction  returns  the  slice  of  rows  from  the  10th  to  the  13th  position.  Note 
that  the  slice  does  not  use  the  index  labels  as  references,  but  the  position.  In  this  case, 
the  labels  of  the  rows  simply  coincide  with  the  position  of  the  rows. 

If  we  want  to  select  a  subset  of  columns  and  rows  using  the  labels  as  our  references 
instead  of  the  positions,  we  can  use  ix  indexing: 

c  \ 

edu  .  ix  [  9  0  :  9  4  ,  [  '  T IME  '  ,  '  GEO  '  ]  ] 

\ _ / 


In  [7]  : 
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Out  [  7  ]  : 


In  [8]  : 


Out  [  8  ]  : 


TIME 

GEO 

90 

2006 

Belgium 

91 

2007 

Belgium 

92 

2008 

Belgium 

93 

2009 

Belgium 

94 

2010 

Belgium 

This  returns  all  the  rows  between  the  indexes  specified  in  the  slice  before  the 
comma,  and  the  columns  specified  as  a  list  after  the  comma.  In  this  case,  ix  references 
the  index  labels,  which  means  that  ix  does  not  return  the  90th  to  94th  rows,  but  it 
returns  all  the  rows  between  the  row  labeled  90  and  the  row  labeled  94;  thus  if  the 
index  100  is  placed  between  the  rows  labeled  as  90  and  94,  this  row  would  also  be 
returned. 


2.6.3  Filtering  Data 

Another  way  to  select  a  subset  of  data  is  by  applying  Boolean  indexing.  This  indexing 
is  commonly  known  as  a  filter.  For  instance,  if  we  want  to  filter  those  values  less 

than  or  equal  to  6.5,  we  can  do  it  like  this: 

/ - \ 

edu  [  edu  ['Value']  >  6. 5].  tail  () 

k _ y 


TIME 

GEO 

Value 

218 

2002 

Cyprus 

6.60 

281 

2005 

Malta 

6.58 

94 

2010 

Belgium 

6.58 

93 

2009 

Belgium 

6.57 

95 

2011 

Belgium 

6.55 

Boolean  indexing  uses  the  result  of  a  Boolean  operation  over  the  data,  returning 
a  mask  with  True  or  False  for  each  row.  The  rows  marked  True  in  the  mask  will 
be  selected.  In  the  previous  example,  the  Boolean  operation  edu  [  'Value '  ]  > 
6 . 5  produces  a  Boolean  mask.  When  an  element  in  the  “Value”  column  is  greater 
than  6.5,  the  corresponding  value  in  the  mask  is  set  to  True,  otherwise  it  is  set  to 
False.  Then,  when  this  mask  is  applied  as  an  index  in  edu  [edu  [  'Value '  ]  > 
6.5],  the  result  is  a  filtered  DataFrame  containing  only  rows  with  values  higher 
than  6.5.  Of  course,  any  of  the  usual  Boolean  operators  can  be  used  for  filtering:  < 
(less  than),<=  (less  than  or  equal  to),  >  (greater  than),  >=  (greater  than  or  equal 
to),  =  (equal  to),  and  !  =  (not  equal  to). 


2.6.4  Filtering  Missing  Values 

Pandas  uses  the  special  value  NaN  (not  a  number)  to  represent  missing  values.  In 
Python,  NaN  is  a  special  floating-point  value  returned  by  certain  operations  when 
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In  [  9  ]  : 


Out  [  9  ]  : 


Table  2.1  List  of  most  common  aggregation  functions 


Function 

Description 

count() 

Number  of  non-null  observations 

sum() 

Sum  of  values 

mean() 

Mean  of  values 

median() 

Arithmetic  median  of  values 

min() 

Minimum 

max() 

Maximum 

prod() 

Product  of  values 

std() 

Unbiased  standard  deviation 

var() 

Unbiased  variance 

one  of  their  results  ends  in  an  undefined  value.  A  subtle  feature  of  NaN  values  is  that 
two  NaN  are  never  equal.  Because  of  this,  the  only  safe  way  to  tell  whether  a  value  is 
missing  in  a  DataFrame  is  by  using  the  isnull  ( )  function.  Indeed,  this  function 

can  be  used  to  filter  rows  with  missing  values: 

/ - \ 

edu  [  edu  ["Value"  ]  .  isnull  ()  ]  .  head  (  ) 

V _ 2 


TIME 

GEO 

Value 

0 

2000 

European  Union  (28  countries) 

NaN 

1 

2001 

European  Union  (28  countries) 

NaN 

36 

2000 

Euro  area  (18  countries) 

NaN 

37 

2001 

Euro  area  (18  countries) 

NaN 

48 

2000 

Euro  area  (17  countries) 

NaN 

2.6.5  Manipulating  Data 

Once  we  know  how  to  select  the  desired  data,  the  next  thing  we  need  to  know  is  how 
to  manipulate  data.  One  of  the  most  straightforward  things  we  can  do  is  to  operate 
with  columns  or  rows  using  aggregation  functions.  Table  2.1  shows  a  list  of  the  most 
common  aggregation  functions.  The  result  of  all  these  functions  applied  to  a  row  or 
column  is  always  a  number.  Meanwhile,  if  a  function  is  applied  to  a  DataFrame  or  a 
selection  of  rows  and  columns,  then  you  can  specify  if  the  function  should  be  applied 
to  the  rows  for  each  column  (setting  the  axis  =  0  keyword  on  the  invocation  of  the 
function),  or  it  should  be  applied  on  the  columns  for  each  row  (setting  the  axis  =  l 
keyword  on  the  invocation  of  the  function). 

s  \ 

edu.max(axis  =  0) 

\ _ / 


In  [10] : 
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Out [ 10 ] 


In  [11] : 


Out [11] 


In  [12] : 


Out [ 12 ] 


In  [13] : 


Out [ 13 ] 


TIME  2011 

GEO  Spain 

Value  8.81 

dtype :  object 

Note  that  these  are  functions  specific  to  Pandas,  not  the  generic  Python  functions. 
There  are  differences  in  their  implementation.  In  Python,  NaN  values  propagate 
through  all  operations  without  raising  an  exception.  In  contrast,  Pandas  operations 
exclude  NaN  values  representing  missing  data.  For  example,  the  pandas  max  function 
excludes  NaN  values,  thus  they  are  interpreted  as  missing  values,  while  the  standard 
Python  max  function  will  take  the  mathematical  interpretation  of  NaN  and  return  it 
as  the  maximum: 


r 

\ 

print 

" Pandas  max 

function: " , 

edu [ ' Value ' ] . max ( ) 

print 

" Python  max 

function:  "  , 

max ( edu [ ' Value ' ] ) 

_ y 

Pandas  max  function:  8.81 
Python  max  function:  nan 

Beside  these  aggregation  functions,  we  can  apply  operations  over  all  the  values  in 
rows,  columns  or  a  selection  of  both.  The  rule  of  thumb  is  that  an  operation  between 
columns  means  that  it  is  applied  to  each  row  in  that  column  and  an  operation  between 
rows  means  that  it  is  applied  to  each  column  in  that  row.  For  example  we  can  apply 
any  binary  arithmetical  operation  (+,-,*,/)  to  an  entire  row: 


y 

\ 

s  =  edu [ "Value " ] /100 

s . head ( ) 

\ _ 

_ y 

0  NaN 

1  NaN 

2  0.0500 

3  0.0503 

4  0.0495 

Name:  Value,  dtype:  float64 

However,  we  can  apply  any  function  to  a  DataFrame  or  Series  just  setting  its  name 
as  argument  of  the  apply  method.  For  example,  in  the  following  code,  we  apply 
the  sqrt  function  from  the  NumPy  library  to  perform  the  square  root  of  each  value 
in  the  Value  column. 


y 

S 

S  . 

V 

=  edu  [  "Value"  ]  .  apply  (  np  .  sqrt  ) 
head  (  ) 

N 

_ 

0 

NaN 

1 

NaN 

2 

2.236068 

3 

2.242766 

4 

2.224860 

Name : 

Value,  dtype:  float64 
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In  [14] : 


Out [14] 


In  [15] : 


Out  [  15 ] 


If  we  need  to  design  a  specific  function  to  apply  it,  we  can  write  an  in-line  function, 
commonly  known  as  a  A -function.  A  A -function  is  a  function  without  a  name.  It  is 
only  necessary  to  specify  the  parameters  it  receives,  between  the  lambda  keyword 
and  the  colon  ( : ).  In  the  next  example,  only  one  parameter  is  needed,  which  will  be 
the  value  of  each  element  in  the  Value  column.  The  value  the  function  returns  will 
be  the  square  of  that  value. 

/  \ 

s  =  edu  [  "Value"  ]  .  apply ( 1 a mb da  d :  d  *  *  2 ) 
s . head ( ) 

\ _ / 


0  NaN 

1  NaN 

2  25.0000 

3  25.3009 

4  24.5025 

Name:  Value,  dtype :  float64 

Another  basic  manipulation  operation  is  to  set  new  values  in  our  DataFrame.  This 
can  be  done  directly  using  the  assign  operator  (=)  over  a  DataFrame.  For  example,  to 
add  a  new  column  to  a  DataFrame,  we  can  assign  a  Series  to  a  selection  of  a  column 
that  does  not  exist.  This  will  produce  a  new  column  in  the  DataFrame  after  all  the 
others.  You  must  be  aware  that  if  a  column  with  the  same  name  already  exists,  the 
previous  values  will  be  overwritten.  In  the  following  example,  we  assign  the  Series 
that  results  from  dividing  the  Value  column  by  the  maximum  value  in  the  same 
column  to  a  new  column  named  ValueNorm. 


/ 

\ 

edu [  ' ValueNorm  '  ] 

=  edu [ 'Value']/ edu [ 'Value']. max () 

edu .tail  (  ) 

V 

_ y 

TIME 

GEO 

Value 

ValueNorm 

379 

2007 

Finland 

5.90 

0.669694 

380 

2008 

Finland 

6.10 

0.692395 

381 

2009 

Finland 

6.81 

0.772985 

382 

2010 

Finland 

6.85 

0 . 777526 

383 

2011 

Finland 

6.76 

0 .767310 

Now,  if  we  want  to  remove  this  column  from  the  DataFrame,  we  can  use  the  drop 
function;  this  removes  the  indicated  rows  if  axis  =  0,  or  the  indicated  columns  if 
axis  =  l.  In  Pandas,  all  the  functions  that  change  the  contents  of  a  DataFrame,  such 
as  the  drop  function,  will  normally  return  a  copy  of  the  modified  data,  instead  of 
overwriting  the  DataFrame.  Therefore,  the  original  DataFrame  is  kept.  If  you  do  not 
want  to  keep  the  old  values,  you  can  set  the  keyword  inplace  to  True.  By  default, 
this  keyword  is  set  to  False,  meaning  that  a  copy  of  the  data  is  returned. 


edu . drop ( ' ValueNorm' ,  axis  =  1,  inplace  =  True) 

edu . head  (  ) 

v 

_ 

In  [16] : 
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Out [ 16 ] 


In  [17] : 


Out [ 17 ] 


In  [18] : 


Out  [  18 ] 


TIME 

GEO 

Value 

0 

2000 

European  Union  (28  countries) 

NaN 

1 

2001 

European  Union  (28  countries) 

NaN 

2 

2002 

European  Union  (28  countries) 

5 

3 

2003 

European  Union  (28  countries) 

5.03 

4 

2004 

European  Union  (28  countries) 

4.95 

Instead,  if  what  we  want  to  do  is  to  insert  a  new  row  at  the  bottom  of  the  DataFrame, 
we  can  use  the  Pandas  append  function.  This  function  receives  as  argument 
the  new  row,  which  is  represented  as  a  dictionary  where  the  keys  are  the  name 
of  the  columns  and  the  values  are  the  associated  value.  You  must  be  aware  to  setting 
the  ignore_index  flag  in  the  append  method  to  True,  otherwise  the  index  0 
is  given  to  this  new  row,  which  will  produce  an  error  if  it  already  exists: 


edu  =  edu . append ( { " TIME " :  2000, 

" Value  "  :  5.00," GEO  "  : 

'  a  '  }  , 

\ 

ignore_index  = 

True  ) 

edu . t a i 1  (  ) 

v 

_ y 

TIME 

GEO 

Value 

380 

2008 

Finland 

6.1 

381 

2009 

Finland 

6.81 

382 

2010 

Finland 

6.85 

383 

2011 

Finland 

6.76 

384 

2000 

a 

5 

Finally,  if  we  want  to  remove  this  row,  we  need  to  use  the  drop  function  again. 
Now  we  have  to  set  the  axis  to  0,  and  specify  the  index  of  the  row  we  want  to  remove. 
Since  we  want  to  remove  the  last  row,  we  can  use  the  max  function  over  the  indexes 
to  determine  which  row  is. 


X 

edu  .  drop  (max  (  edu  .index),  axis  =  0,  i  np  lace  =  True) 

edu . t a i 1  (  ) 

_ y 

TIME 

GEO 

Value 

379 

2007 

Finland 

5.9 

380 

2008 

Finland 

6.1 

381 

2009 

Finland 

6.81 

382 

2010 

Finland 

6.85 

383 

2011 

Finland 

6.76 

The  drop  ( )  function  is  also  used  to  remove  missing  values  by  applying  it  over 
the  result  of  the  isnull  ( )  function.  This  has  a  similar  effect  to  filtering  the  NaN 
values,  as  we  explained  above,  but  here  the  difference  is  that  a  copy  of  the  DataFrame 
without  the  NaN  values  is  returned,  instead  of  a  view. 


r 

N 

eduDrop  =  edu  .  drop  (  edu  [  "Value  "  ]  .  isnull  (  )  ,  axis  =  0) 

eduDrop . head ( ) 

v 

_ y 

In  [19] : 
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Out [ 19 ] 


In  [20]  : 


Out [20] 


In  [21]  : 


Out [21] 


TIME 

GEO 

Value 

2 

2002 

European  Union  (28  countries) 

5.00 

3 

2003 

European  Union  (28  countries) 

5.03 

4 

2004 

European  Union  (28  countries) 

4.95 

5 

2005 

European  Union  (28  countries) 

4.92 

6 

2006 

European  Union  (28  countries) 

4.91 

To  remove  NaN  values,  instead  of  the  generic  drop  function,  we  can  use  the  specific 
dropna  ( )  function.  If  we  want  to  erase  any  row  that  contains  an  NaN  value,  we 
have  to  set  the  how  keyword  to  any.  To  restrict  it  to  a  subset  of  columns,  we  can 
specify  it  using  the  subset  keyword.  As  we  can  see  below,  the  result  will  be  the 
same  as  using  the  drop  function: 

f  \ 

eduDrop  =  edu . dropna (how  =  ' any subset  =  ["Value"]) 

eduDrop . head ( ) 

\ _ / 


TIME 

GEO 

Value 

2 

2002 

European  Union  (28  countries) 

5.00 

3 

2003 

European  Union  (28  countries) 

5.03 

4 

2004 

European  Union  (28  countries) 

4.95 

5 

2005 

European  Union  (28  countries) 

4.92 

6 

2006 

European  Union  (28  countries) 

4.91 

If,  instead  of  removing  the  rows  containing  NaN,  we  want  to  fill  them  with  another 
value,  then  we  can  use  the  f  illna  ( )  method,  specifying  which  value  has  to  be 
used.  If  we  want  to  fill  only  some  specific  columns,  we  have  to  set  as  argument  to 
the  f  illna  ( )  function  a  dictionary  with  the  name  of  the  columns  as  the  key  and 
which  character  to  be  used  for  filling  as  the  value. 


\ 

eduFilled  =  edu . f i 1 lna ( value  =  ("Value":  0}) 

eduFilled . head ( ) 

V 

_ 

TIME 

GEO 

Value 

0 

2000 

European  Union  (28  countries) 

0.00 

1 

2001 

European  Union  (28  countries) 

0.00 

2 

2002 

European  Union  (28  countries) 

5.00 

3 

2003 

European  Union  (28  countries) 

4.95 

4 

2004 

European  Union  (28  countries) 

4.95 

2.6.6  Sorting 

Another  important  functionality  we  will  need  when  inspecting  our  data  is  to  sort  by 
columns.  We  can  sort  a  DataFrame  using  any  column,  using  the  sort  function.  If 
we  want  to  see  the  first  five  rows  of  data  sorted  in  descending  order  (i.e.,  from  the 
largest  to  the  smallest  values)  and  using  the  Value  column,  then  we  just  need  to  do 
this: 
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In  [22]  : 


Out [ 22 ] 


In  [23]  : 


Out [ 23 ] 


edu . sort_values (by  =  'Value',  ascending  =  False, 

\ 

inplace  =  True) 

edu . head  (  ) 

V 

_ y 

TIME 

GEO 

Value 

130 

2010 

Denmark 

8.81 

131 

2011 

Denmark 

8.75 

129 

2009 

Denmark 

8.74 

121 

2001 

Denmark 

8.44 

122 

2002 

Denmark 

8.44 

Note  that  the  inplace  keyword  means  that  the  DataFrame  will  be  overwritten, 
and  hence  no  new  DataFrame  is  returned.  If  instead  of  ascending  =  False  we 
use  ascending  =  True,  the  values  are  sorted  in  ascending  order  (i.e.,  from  the 
smallest  to  the  largest  values). 

If  we  want  to  return  to  the  original  order,  we  can  sort  by  an  index  using  the 
sort_index  function  and  specifying  axis  =  0: 


/ - N 

edu . sort_index ( axis  =  0,  ascending  =  True,  inplace  =  True) 

edu . head  (  ) 

v _ y 


TIME 

GEO 

Value 

0 

2000 

European 

Union  . . . 

NaN 

1 

2001 

European 

Union  . . . 

NaN 

2 

2002 

European 

Union  . . . 

5.00 

3 

2003 

European 

Union  . . . 

5.03 

4 

2004 

European 

Union  . . . 

4.95 

2.6.7  Grouping  Data 

Another  very  useful  way  to  inspect  data  is  to  group  it  according  to  some  criteria.  For 
instance,  in  our  example  it  would  be  nice  to  group  all  the  data  by  country,  regardless 
of  the  year.  Pandas  has  the  groupby  function  that  allows  us  to  do  exactly  this.  The 
value  returned  by  this  function  is  a  special  grouped  DataFrame.  To  have  a  proper 
DataFrame  as  a  result,  it  is  necessary  to  apply  an  aggregation  function.  Thus,  this 
function  will  be  applied  to  all  the  values  in  the  same  group. 

For  example,  in  our  case,  if  we  want  a  DataFrame  showing  the  mean  of  the  values 
for  each  country  over  all  the  years,  we  can  obtain  it  by  grouping  according  to  country 
and  using  the  mean  function  as  the  aggregation  method  for  each  group.  The  result 
would  be  a  DataFrame  with  countries  as  indexes  and  the  mean  values  as  the  column: 


\ 

group  =  edu  [  [  "  GEO  "  ,  "  Value  "  ] 

. groupby ( ' GEO ' 

. mean  (  ) 

group . head ( ) 

V 

_ y 

In  [24]  : 
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Out [ 24 ] 


In  [25]  : 


Out [25] 


In  [26]  : 


Out [26] 


Value 

GEO 

Austria 

5.618333 

Belgium 

6.189091 

Bulgaria 

4.093333 

Cyprus 

7 . 023333 

Czech  Republic 

4.16833 

2.6.8  Rearranging  Data 

Up  until  now,  our  indexes  have  been  just  a  numeration  of  rows  without  much  meaning. 
We  can  transform  the  arrangement  of  our  data,  redistributing  the  indexes  and  columns 
for  better  manipulation  of  our  data,  which  normally  leads  to  better  performance.  We 
can  rearrange  our  data  using  the  pivot_table  function.  Here,  we  can  specify 
which  columns  will  be  the  new  indexes,  the  new  values,  and  the  new  columns. 

For  example,  imagine  that  we  want  to  transform  our  DataFrame  to  a  spreadsheet¬ 
like  structure  with  the  country  names  as  the  index,  while  the  columns  will  be  the 
years  starting  from  2006  and  the  values  will  be  the  previous  Value  column.  To  do 
this,  first  we  need  to  filter  out  the  data  and  then  pivot  it  in  this  way: 

f - \ 

f i 1 t ered_da t a  =  edu [ edu [ "TIME " ]  >  2005] 

pivedu  =  pd . p i vo t _t abl e ( f i 1 t er ed_da t a ,  values  =  'Value', 

index  =  [ ' GEO ' ] , 

columns  =  [ ' TIME ' ] ) 

pivedu . head  (  ) 

V _ / 


TIME 

2 

006 

2 

007 

2 

008 

2009 

2 

010 

20 

11 

GEO 

Austria 

5 

.40 

5 

.33 

5 

.47 

5. 

98 

5 

.91 

5. 

80 

Belgium 

5 

.  98 

6 

.  00 

6 

.43 

6. 

57 

6 

.58 

6. 

55 

Bulgaria 

4 

.  04 

3 

.88 

4 

.44 

4. 

58 

4 

.10 

3  . 

82 

Cyprus 

7 

.  02 

6 

.95 

7 

.45 

7  . 

98 

7 

.92 

7  . 

87 

Czech  Republic 

4 

.42 

4 

.  05 

3 

.92 

4. 

36 

4 

.25 

4. 

51 

Now  we  can  use  the  new  index  to  select  specific  rows  by  label,  using  the  ix 
operator: 


\ 

pivedu .  ix [  [  '  Spain  '  ,  ' Portugal  ' 

]  ,  [2  006  ,  2011]  ] 

v 

_ y 

TIME 

2006 

2011 

GEO 

Spain 

4.26 

4.82 

Portugal 

5.07 

5.27 

Pivot  also  offers  the  option  of  providing  an  argument  aggr_function  that 
allows  us  to  perform  an  aggregation  function  between  the  values  if  there  is  more 
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In  [27]  : 


Out [ 27 ] 


than  one  value  for  the  given  row  and  column  after  the  transformation.  As  usual,  you 
can  design  any  custom  function  you  want,  just  giving  its  name  or  using  a  A-function. 


2.6.9  Ranking  Data 

Another  useful  visualization  feature  is  to  rank  data.  For  example,  we  would  like  to 
know  how  each  country  is  ranked  by  year.  To  see  this,  we  will  use  the  pandas  rank 
function.  But  first,  we  need  to  clean  up  our  previous  pivoted  table  a  bit  so  that  it  only 
has  real  countries  with  real  data.  To  do  this,  first  we  drop  the  Euro  area  entries  and 
shorten  the  Germany  name  entry,  using  the  rename  function  and  then  we  drop  all 
the  rows  containing  any  NaN,  using  the  dropna  function. 

Now  we  can  perform  the  ranking  using  the  rank  function.  Note  here  that  the 
parameter  ascending=False  makes  the  ranking  go  from  the  highest  values  to 
the  lowest  values.  The  Pandas  rank  function  supports  different  tie-breaking  methods, 
specified  with  the  method  parameter.  In  our  case,  we  use  the  first  method,  in 
which  ranks  are  assigned  in  the  order  they  appear  in  the  array,  avoiding  gaps  between 
ranking. 

f - 'i 

pivedu  =  pivedu . drop  (  [ 


'  Euro 

area 

(13 

c oun t r i e s )  '  , 

'  Euro 

area 

(15 

c oun t r i e s )  '  , 

'  Euro 

area 

(17 

c oun t r i e s )  '  , 

'  Euro 

area 

(18 

c oun t r i e s )  '  , 

' European 

Union 

(25 

countries)  ' 

' European 

Union 

(27 

countries)  ' 

' European 

Union 

(28 

countries)  ' 

]  , 

axi s  =  0 ) 

pivedu  =  pivedu  .  rename  ( index  =  {'Germany  (until  19  9  0  former  territory 
of  the  FRG) ' :  ' Germany ' } ) 

pivedu  =  pivedu . dropna  (  ) 

pivedu . rank ( ascending  =  False,  method  =  ' first'). head () 

V _ / 


TIME 

2006 

2007 

2008 

2009 

2010 

2011 

GEO 

Austria 

10 

7 

11 

7 

8 

8 

Belgium 

5 

4 

3 

4 

5 

5 

Bulgaria 

21 

21 

20 

20 

22 

21 

Cyprus 

2 

2 

2 

2 

2 

3 

Czech  Republic 

19 

20 

21 

21 

20 

18 

If  we  want  to  make  a  global  ranking  taking  into  account  all  the  years,  we  can 
sum  up  all  the  columns  and  rank  the  result.  Then  we  can  sort  the  resulting  values  to 
retrieve  the  top  five  countries  for  the  last  6  years,  in  this  way: 


N 

totalSum  =  pivedu . sum ( axi s  =  1) 

t o t a 1  Sum . r ank ( a s c endi ng  =  False,  method  = 

' dense ' ) 

. sort_values ( ) . head ( ) 

_ y 

In  [28]  : 
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Out  [28]  :  GE0 

Denmark  1 
Cyprus  2 
Finland  3 
Malta  4 
Belgium  5 


dtype :  float64 

Notice  that  the  method  keyword  argument  in  the  in  the  rank  function  specifies 
how  items  that  compare  equals  receive  ranking.  In  the  case  of  dense,  items  that 
compare  equals  receive  the  same  ranking  number,  and  the  next  not  equal  item  receives 
the  immediately  following  ranking  number. 


2.6.10  Plotting 

Pandas  DataFrames  and  Series  can  be  plotted  using  the  plot  function,  which  uses 
the  library  for  graphics  Matplotlib.  For  example,  if  we  want  to  plot  the  accumulated 
values  for  each  country  over  the  last  6  years,  we  can  take  the  Series  obtained  in  the 
previous  example  and  plot  it  directly  by  calling  the  plot  function  as  shown  in  the 
next  cell: 


\ 

totalSum  =  pivedu . sum ( axi s  =  1) 

. sort_values (ascending 

=  False ) 

t o t a 1  Sum . p 1 o t  ( kind  =  'bar',  style  =  'b', 

alpha  =  0.4, 

title  =  "Total  Values  for 

Country  "  ) 

V 

_ y 

Out  [29]: 


GEO 


Note  that  if  we  want  the  bars  ordered  from  the  highest  to  the  lowest  value,  we 
need  to  sort  the  values  in  the  Series  first.  The  parameter  kind  used  in  the  plot 
function  defines  which  kind  of  graphic  will  be  used.  In  our  case,  a  bar  graph.  The 
parameter  style  refers  to  the  style  properties  of  the  graphic,  in  our  case,  the  color 


2.6  Get  Started  with  Python  for  Data  Scientists 


27 


In  [30]  : 


Out [30] 


of  bars  is  set  to  b  (blue).  The  alpha  channel  can  be  modified  adding  a  keyword 
parameter  alpha  with  a  percentage,  producing  a  more  translucent  plot.  Finally, 
using  the  title  keyword  the  name  of  the  graphic  can  be  set. 

It  is  also  possible  to  plot  a  DataFrame  directly.  In  this  case,  each  column  is  treated 
as  a  separated  Series.  For  example,  instead  of  printing  the  accumulated  value  over 

the  years,  we  can  plot  the  value  for  each  year. 

/  \ 

my_colors  =  ['b',  'r',  'g',  'y',  'm';  'c'] 

ax  =  pivedu . plot ( kind  =  ' barh ' , 

stacked  =  True, 
color  =  my_colors) 

ax . legend ( loc  =  'center  left',  bbox_to_anchor  =  ( 1 ,  . 5 ) ) 
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In  this  case,  we  have  used  a  horizontal  bar  graph  (kind=  '  barh ' )  stacking  all  the 
years  in  the  same  country  bar.  This  can  be  done  by  setting  the  parameter  stacked 
to  True.  The  number  of  default  colors  in  a  plot  is  only  5,  thus  if  you  have  more 
than  5  Series  to  show,  you  need  to  specify  more  colors  or  otherwise  the  same  set  of 
colors  will  be  used  again.  We  can  set  a  new  set  of  colors  using  the  keyword  color 
with  a  list  of  colors.  Basic  colors  have  a  single-character  code  assigned  to  each,  for 
example,  “b”  is  for  blue,  “r”  for  red,  “g”  for  green,  “y”  for  yellow,  “m”  for  magenta, 
and  “c”  for  cyan.  When  several  Series  are  shown  in  a  plot,  a  legend  is  created  for 
identifying  each  one.  The  name  for  each  Series  is  the  name  of  the  column  in  the 
DataFrame.  By  default,  the  legend  goes  inside  the  plot  area.  If  we  want  to  change 
this,  we  can  use  the  legend  function  of  the  axis  object  (this  is  the  object  returned 
when  the  plot  function  is  called).  By  using  the  loc  keyword,  we  can  set  the  relative 
position  of  the  legend  with  respect  to  the  plot.  It  can  be  a  combination  of  right  or 
left  and  upper,  lower,  or  center.  With  bbox_to_anchor  we  can  set  an  absolute 
position  with  respect  to  the  plot,  allowing  us  to  put  the  legend  outside  the  graph. 
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2.7  Conclusions 

This  chapter  has  been  a  brief  introduction  to  the  most  essential  elements  of  a  pro¬ 
gramming  environment  for  data  scientists.  The  tutorial  followed  in  this  chapter  is 
just  a  starting  point  for  more  advanced  projects  and  techniques.  As  we  will  see  in 
the  following  chapters,  Python  and  its  ecosystem  is  a  very  empowering  choice  for 
developing  data  science  projects. 

Acknowledgements  This  chapter  was  co-written  by  Eloi  Puertas  and  Francesc  Danti. 


Descriptive  Statistics 


3.1  Introduction 

Descriptive  statistics  helps  to  simplify  large  amounts  of  data  in  a  sensible  way. 
In  contrast  to  inferential  statistics,  which  will  be  introduced  in  a  later  chapter,  in 
descriptive  statistics  we  do  not  draw  conclusions  beyond  the  data  we  are  analyzing; 
neither  do  we  reach  any  conclusions  regarding  hypotheses  we  may  make.  We  do  not 
try  to  infer  characteristics  of  the  “population”  (see  below)  of  the  data,  but  claim  to 
present  quantitative  descriptions  of  it  in  a  manageable  form.  It  is  simply  a  way  to 
describe  the  data. 

Statistics,  and  in  particular  descriptive  statistics,  is  based  on  two  main  concepts: 

•  a  population  is  a  collection  of  objects,  items  (“units”)  about  which  information  is 
sought; 

•  a  sample  is  a  part  of  the  population  that  is  observed. 

Descriptive  statistics  applies  the  concepts,  measures,  and  terms  that  are  used  to 
describe  the  basic  features  of  the  samples  in  a  study.  These  procedures  are  essential 
to  provide  summaries  about  the  samples  as  an  approximation  of  the  population. 
Together  with  simple  graphics,  they  form  the  basis  of  every  quantitative  analysis  of 
data.  In  order  to  describe  the  sample  data  and  to  be  able  to  infer  any  conclusion,  we 
should  go  through  several  steps: 

1.  Data  preparation :  Given  a  specific  example,  we  need  to  prepare  the  data  for 
generating  statistically  valid  descriptions. 

2.  Descriptive  statistics :  This  generates  different  statistics  to  describe  and  summa¬ 
rize  the  data  concisely  and  evaluate  different  ways  to  visualize  them. 
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3.2  Data  Preparation 

One  of  the  first  tasks  when  analyzing  data  is  to  collect  and  prepare  the  data  in  a  format 

appropriate  for  analysis  of  the  samples.  The  most  common  steps  for  data  preparation 

involve  the  following  operations. 

1 .  Obtaining  the  data:  Data  can  be  read  directly  from  a  file  or  they  might  be  obtained 
by  scraping  the  web. 

2.  Parsing  the  data:  The  right  parsing  procedure  depends  on  what  format  the  data 
are  in:  plain  text,  fixed  columns,  CSV,  XML,  HTML,  etc. 

3.  Cleaning  the  data:  Survey  responses  and  other  data  files  are  almost  always  in¬ 
complete.  Sometimes,  there  are  multiple  codes  for  things  such  as,  not  asked,  did 
not  know,  and  declined  to  answer.  And  there  are  almost  always  errors.  A  simple 
strategy  is  to  remove  or  ignore  incomplete  records. 

4.  Building  data  structures :  Once  you  read  the  data,  it  is  necessary  to  store  them  in 
a  data  structure  that  lends  itself  to  the  analysis  we  are  interested  in.  If  the  data  fit 
into  the  memory,  building  a  data  structure  is  usually  the  way  to  go.  If  not,  usually 
a  database  is  built,  which  is  an  out-of-memory  data  structure.  Most  databases 
provide  a  mapping  from  keys  to  values,  so  they  serve  as  dictionaries. 


3.2.1  The  Adult  Example 

Let  us  consider  a  public  database  called  the  ‘Adult”  dataset,  hosted  on  the  UCI’s 
Machine  Learning  Repository.  It  contains  approximately  32,000  observations  con¬ 
cerning  different  financial  parameters  related  to  the  US  population:  age,  sex,  marital 
(marital  status  of  the  individual),  country,  income  (Boolean  variable:  whether  the  per¬ 
son  makes  more  than  $50,000  per  annum),  education  (the  highest  level  of  education 
achieved  by  the  individual),  occupation,  capital  gain,  etc. 

We  will  show  that  we  can  explore  the  data  by  asking  questions  like:  “Are  men 
more  likely  to  become  high-income  professionals  than  women,  i.e.,  to  receive  an 
income  of  over  $50,000  per  annum?” 


1  https://archive.ics.uci.edu/ml/datasets/Adult. 
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First,  let  us  read  the  data: 


In  [2] : 

f  \ 

print  data  [1:2] 

^ _ y 

Out [ 2 ]  : 

[[50,  ' Self-emp-not-inc ' ,  83311,  'Bachelors',  13, 

'Married-civ-spouse',  'Exec-managerial',  'Husband',  'White', 
'Male',  0,  0,  13,  'United-States',  '<=50^T']] 

One  of  the  easiest  ways  to  manage  data  in  Python  is  by  using  the  DataFrame 
structure,  defined  in  the  Pandas  library,  which  is  a  two-dimensional,  size-mutable, 
potentially  heterogeneous  tabular  data  structure  with  labeled  axes: 

In  [3]  : 

f  \ 

df  =  pd. DataFrame (data) 
df . columns  =  [ 

'  age '  ,  '  type_emp 1 oy er '  ,  '  fnlwgt  '  , 

' education ' ,  ' educ a t i on_num ',  'marital', 

'occupation' , '  relationship' ,  'race' , 

' sex' ,  ' capital_gain ' ,  ' capital_loss ' , 

' hr_per_week '  ,  ' country  '  ,  '  income  ' 

] 

V _ y 

The  command  shape  gives  exactly  the  number  of  data  samples  (in  rows,  in  this 
case)  and  features  (in  columns): 

In  [4]  : 

r  x 

df . shape 

v  J 

Out [ 4 ]  : 


(32561,  15) 
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In  [  5  ]  : 


Out [ 5 ]  : 


In  [6]  : 


In  [7] : 


Thus,  we  can  see  that  our  dataset  contains  32,561  data  records  with  15  features 

each.  Let  us  count  the  number  of  items  per  country: 

/  \ 

counts  =  df  .  groupby (  ' country '  )  .  size  (  ) 
print  counts . head ( ) 

V _ / 


country 
?  583 

Cambodia  19 
Vietnam  67 
Yugoslavia  16 

The  first  row  shows  the  number  of  samples  with  unknown  country,  followed  by 
the  number  of  samples  corresponding  to  the  first  countries  in  the  dataset. 

Let  us  split  people  according  to  their  gender  into  two  groups:  men  and  women. 

/  \ 

ml  =  df  [  ( df .  sex  =  =  'Male')] 

V  _ / 

If  we  focus  on  high-income  professionals  separated  by  sex,  we  can  do: 

/ - \ 

mil  =  df  [  (df  .  sex  =  =  'Male')  &  (  df  .  inc  ome  =  =  '  >  5  0  K  \  n  '  ) 

] 

fm  =  df  [  ( df .  sex  =  =  'Female')] 

fml  =  df  [  (df  .  sex  =  =  'Female')  &  (  df  .  inc  ome  =  =  '  >  5  0  K  \  n 
'  )  ] 

V  _ / 


3.3  Exploratory  Data  Analysis 

The  data  that  come  from  performing  a  particular  measurement  on  all  the  subjects 
in  a  sample  represent  our  observations  for  a  single  characteristic  like  country, 
age,  education,  etc.  These  measurements  and  categories  represent  a  sample 
distribution  of  the  variable,  which  in  turn  approximately  represents  the  population 
distribution  of  the  variable.  One  of  the  main  goals  of  exploratory  data  analysis  is 
to  visualize  and  summarize  the  sample  distribution,  thereby  allowing  us  to  make 
tentative  assumptions  about  the  population  distribution. 


3.3.1  Summarizing  the  Data 

The  data  in  general  can  be  categorical  or  quantitative.  For  categorical  data,  a  simple 
tabulation  of  the  frequency  of  each  category  is  the  best  non-graphical  exploration 
for  data  analysis.  For  example,  we  can  ask  ourselves  what  is  the  proportion  of  high- 
income  professionals  in  our  database: 
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In  [8] : 


Out [ 8 ]  : 


df  1  = 

df  [  (df  .  income=  =  '  >50K\n  '  )  ] 

\ 

print 

'The  rate  of  people  with  high  income  is:  '  , 
int  (len(dfl)  /float  (len(df)  )  *100)  ,  '  %  .  ' 

print 

'The  rate  of  men  with  high  income  is:  '  , 
int  (len(mll)  /float  (len(ml)  )  *100)  ,  '%.  ' 

print 

'The  rate  of  women  with  high  income  is:  '  , 

v 

int  (len(fml)  /float  (len(fm)  )  *100)  ,  '%.  ' 

_ y 

The  rate  of  people  with  high  income  is:  24  %. 

The  rate  of  men  with  high  income  is:  30  %. 

The  rate  of  women  with  high  income  is:  10  %. 

Given  a  quantitative  variable,  exploratory  data  analysis  is  a  way  to  make  prelim¬ 
inary  assessments  about  the  population  distribution  of  the  variable  using  the  data  of 
the  observed  samples.  The  characteristics  of  the  population  distribution  of  a  quanti¬ 
tative  variable  are  its  mean ,  deviation ,  histograms ,  outliers ,  etc.  Our  observed  data 
represent  just  a  finite  set  of  samples  of  an  often  infinite  number  of  possible  samples. 
The  characteristics  of  our  randomly  observed  samples  are  interesting  only  to  the 
degree  that  they  represent  the  population  of  the  data  they  came  from. 


3.3. 1.1  Mean 


One  of  the  first  measurements  we  use  to  have  a  look  at  the  data  is  to  obtain  sample 
statistics  from  the  data,  such  as  the  sample  mean  [1].  Given  a  sample  of  n  values, 
{xi } ,  i  =  1 ,  ,n,  the  mean ,  /i,  is  the  sum  of  the  values  divided  by  the  number  of 
values,2  in  other  words: 


(3.1) 


The  terms  mean  and  average  are  often  used  interchangeably.  In  fact,  the  main 
distinction  between  them  is  that  the  mean  of  a  sample  is  the  summary  statistic  com¬ 
puted  by  Eq.  (3.1),  while  an  average  is  not  strictly  defined  and  could  be  one  of  many 
summary  statistics  that  can  be  chosen  to  describe  the  central  tendency  of  a  sample. 

In  our  case,  we  can  consider  what  the  average  age  of  men  and  women  samples  in 
our  dataset  would  be  in  terms  of  their  mean: 


2We  will  use  the  following  notation:  X  is  a  random  variable,  x  is  a  column  vector,  xT  (the  transpose 
of  x)  is  a  row  vector,  X  is  a  matrix,  and  Xi  is  the  i-th  element  of  a  dataset. 
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In  [9]  : 


Out [ 9 ]  : 


The  average  age  of  men  is:  39.4335474989 
The  average  age  of  women  is:  36.8582304336 
The  average  age  of  high-income  men  is:  44.6257880516 
The  average  age  of  high-income  women  is:  42.1255301103 

This  difference  in  the  sample  means  can  be  considered  initial  evidence  that  there 
are  differences  between  men  and  women  with  high  income! 

Comment:  Later,  we  will  work  with  both  concepts:  the  population  mean  and  the 
sample  mean.  We  should  not  confuse  them!  The  first  is  the  mean  of  samples  taken 
from  the  population;  the  second,  the  mean  of  the  whole  population. 


3.3.1 .2  Sample  Variance 

The  mean  is  not  usually  a  sufficient  descriptor  of  the  data.  We  can  go  further  by 
knowing  two  numbers:  mean  and  variance.  The  variance  a2  describes  the  spread  of 
the  data  and  it  is  defined  as  follows: 

<7  =  -  y>,  -  m)2.  (3.2) 

i 

The  term  (v;  —  /i)  is  called  the  deviation  from  the  mean,  so  the  variance  is  the  mean 
squared  deviation.  The  square  root  of  the  variance,  a,  is  called  the  standard  deviation. 
We  consider  the  standard  deviation,  because  the  variance  is  hard  to  interpret  (e.g.,  if 
the  units  are  grams,  the  variance  is  in  grams  squared). 

Let  us  compute  the  mean  and  the  variance  of  hours  per  week  men  and  women  in 
our  dataset  work: 


In  [10] : 
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Outfit)]:  Statistics  of  age  for  men:  mu:  39.4335474989  var :  178.773751745 
std:  13.3706301925 

Statistics  of  age  for  women:  mu:  36.8582304336  var: 

196.383706395  std:  14.0136970994 

We  can  see  that  the  mean  number  of  hours  worked  per  week  by  women  is  signif¬ 
icantly  lesser  than  that  worked  by  men,  but  with  much  higher  variance  and  standard 
deviation. 


3.3.1 .3  Sample  Median 

The  mean  of  the  samples  is  a  good  descriptor,  but  it  has  an  important  drawback:  what 
will  happen  if  in  the  sample  set  there  is  an  error  with  a  value  very  different  from  the 
rest?  For  example,  considering  hours  worked  per  week,  it  would  normally  be  in  a 
range  between  20  and  80;  but  what  would  happen  if  by  mistake  there  was  a  value 
of  1000?  An  item  of  data  that  is  significantly  different  from  the  rest  of  the  data  is 
called  an  outlier.  In  this  case,  the  mean,  /i,  will  be  drastically  changed  towards  the 
outlier.  One  solution  to  this  drawback  is  offered  by  the  statistical  median ,  /112,  which 
is  an  order  statistic  giving  the  middle  value  of  a  sample.  In  this  case,  all  the  values 
are  ordered  by  their  magnitude  and  the  median  is  defined  as  the  value  that  is  in  the 
middle  of  the  ordered  list.  Hence,  it  is  a  value  that  is  much  more  robust  in  the  face 
of  outliers. 


Let  us  see,  the  median  age  of  working  men  and  women  in  our  dataset  and  the 
median  age  of  high-income  men  and  women: 


Out [11] :  Median  age  per  men  and  women:  38.0  35.0 

Median  age  per  men  and  women  with  high-income:  44.0  41.0 

As  expected,  the  median  age  of  high-income  people  is  higher  than  the  whole  set 
of  working  people,  although  the  difference  between  men  and  women  in  both  sets  is 
the  same. 


3.3.1 .4  Quantiles  and  Percentiles 

Sometimes  we  are  interested  in  observing  how  sample  data  are  distributed  in  general. 
In  this  case,  we  can  order  the  samples  {xt },  then  find  the  xp  so  that  it  divides  the  data 
into  two  parts,  where: 
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Fig.  3.1  Histogram  of  the  age  of  working  men  (left)  and  women  (right) 


•  a  fraction  p  of  the  data  values  is  less  than  or  equal  to  xp  and 

•  the  remaining  fraction  (1  —  p)  is  greater  than  xp. 

That  value,  xp ,  is  the  p-th  quantile,  or  the  100  x  p-th  percentile.  For  example,  a 
5-number  summary  is  defined  by  the  values  xmin,  Q\ ,  <2  2 ,  Q3 ,  xmax,  where  Q\  is 
the  25  x  p-th  percentile,  Q2  is  the  50  x  p-th  percentile  and  Q3  is  the  75  x  p-th 
percentile. 


3.3.2  Data  Distributions 

Summarizing  data  by  just  looking  at  their  mean,  median,  and  variance  can  be  danger¬ 
ous:  very  different  data  can  be  described  by  the  same  statistics.  The  best  thing  to  do 
is  to  validate  the  data  by  inspecting  them.  We  can  have  a  look  at  the  data  distribution, 
which  describes  how  often  each  value  appears  (i.e.,  what  is  its  frequency). 

The  most  common  representation  of  a  distribution  is  a  histogram ,  which  is  a  graph 
that  shows  the  frequency  of  each  value.  Let  us  show  the  age  of  working  men  and 
women  separately. 


The  output  can  be  seen  in  Fig.  3.1.  If  we  want  to  compare  the  histograms,  we  can 
plot  them  overlapping  in  the  same  graphic  as  follows: 
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Fig.  3.2  Histogram  of  the  age  of  working  men  (in  ochre)  and  women  (in  violet )  (left).  Histogram  of 
the  age  of  working  men  (in  ochre),  women  (in  blue),  and  their  intersection  (in  violet)  after  samples 
normalization  (right) 


The  output  can  be  seen  in  Fig.  3.2  (left).  Note  that  we  are  visualizing  the  absolute 
values  of  the  number  of  people  in  our  dataset  according  to  their  age  (the  abscissa  of 
the  histogram).  As  a  side  effect,  we  can  see  that  there  are  many  more  men  in  these 
conditions  than  women. 


We  can  normalize  the  frequencies  of  the  histogram  by  dividing/normalizing  by 
n,  the  number  of  samples.  The  normalized  histogram  is  called  the  Probability  Mass 
Function  (PMF). 


\ 

f m_age .hist (normed 

=  1, 

his  t  typ  e  = 

' s  t ep f i 1 1 ed  7  , 

alpha  = 

.  5  , 

bins  =  20) 

ml_age .hist (normed 

=  1, 

his  t  typ  e  = 

'  s  t  ep  f i 1 1 ed  '  , 

alpha  = 

•  5  , 

bins  =  10, 

color  = 

sns 

. desaturate  ( 

" indianred"  , 

.  75  ) 

) 

c 

_ 

This  outputs  Fig.  3.2  (right),  where  we  can  observe  a  comparable  range  of  indi¬ 
viduals  (men  and  women). 

The  Cumulative  Distribution  Function  (CDF),  or  just  distribution  function, 
describes  the  probability  that  a  real-valued  random  variable  X  with  a  given  proba¬ 
bility  distribution  will  be  found  to  have  a  value  less  than  or  equal  to  v.  Let  us  show 
the  CDF  of  age  distribution  for  both  men  and  women. 
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In  [1 


Fig.  3.3  The  CDF  of  the  age 
of  working  male  (in  blue ) 
and  female  (in  red)  samples 


N 

ml_age . hist (normed  =  1, 

histtype  =  'step', 

cumulat i ve 

=  True,  linewidth  = 

3.5, 

bins  =  20) 

fm_age . hist ( normed  =  1, 

his  t  typ  e  =  ' step  '  , 

cumulat i ve 

=  True,  linewidth  = 

3.5, 

bins  =  20, 

color  =  sns 

. desaturate  ( " indianred"  , 

.  75)  ) 

_ 

J 

The  output  can  be  seen  in  Fig.  3.3,  which  illustrates  the  CDF  of  the  age  distributions 
for  both  men  and  women. 


3.3.3  Outlier  Treatment 

As  mentioned  before,  outliers  are  data  samples  with  a  value  that  is  far  from  the  central 
tendency.  Different  rules  can  be  defined  to  detect  outliers,  as  follows: 

•  Computing  samples  that  are  far  from  the  median. 

•  Computing  samples  whose  values  exceed  the  mean  by  2  or  3  standard  deviations. 

For  example,  in  our  case,  we  are  interested  in  the  age  statistics  of  men  versus 
women  with  high  incomes  and  we  can  see  that  in  our  dataset,  the  minimum  age  is  17 
years  and  the  maximum  is  90  years.  We  can  consider  that  some  of  these  samples  are 
due  to  errors  or  are  not  representable.  Applying  the  domain  knowledge,  we  focus  on 
the  median  age  (37,  in  our  case)  up  to  72  and  down  to  22  years  old,  and  we  consider 
the  rest  as  outliers. 
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In  [17] : 


In  [18] : 


Out [ 18 ]  : 


In  [19] : 


( - \ 

df2  =  df  .  drop  (df  .  index  [ 

(df.  income  =  =  '  >50K\n '  )  & 

(df  [  '  age  '  ]  >  df  [  '  age  '  ]  .  median  (  )  +  35)  & 

(df  [  '  age  '  ]  >  df  [  '  age  '  ]  .  median  (  )  -15) 

]  ) 

mll_age  =  mil  ['age'] 

fml_age  =  fml  [  '  age  '  ] 

ml2_age  =  mll_age . drop (mll_age .  index  [ 

(mll_age  >  df ['age']. median ()  +  35)  & 

(mll_age  >  df ['age']. median ()  -  15) 

]  ) 

fm2_age  =  fml_age . drop ( fml_age .  index  [ 

(fml_age  >  df ['age']. median ()  +  35)  & 

(fml_age  >  df ['age']. median ()  -  15) 

]  ) 

v _ / 


We  can  check  how  the  mean  and  the  median  changed  once  the  data  were  cleaned: 


r 


mu2ml  =  ml 2_age . mean ( ) 
std2ml  =  ml2_age . s td ( ) 
md2ml  =  ml2_age . median ( ) 
mu2fm  =  fm2_age  . mean  (  ) 
std2fm  =  f m2_age . s td ( ) 
md2fm  =  fm2_age . median ( ) 


print  "Men  statistics:" 
print  "Mean: " ,  mu 2 ml ,  " Std : 

print  "Median: " ,  md2ml 
print  "Min:",  ml2_age . min ( ) 


s  td2ml 

Max :  "  ,  ml2_age  . max  (  ) 


print 

print 

print 

print 


"Women  statistics : " 

"  Mean  :  "  ,  mu2  fm  ,  "  Std  : 

" Median :  "  ,  md2  fm 
"Min: " ,  fm2_age .min () 


s  t d2  f m 

Max :  "  ,  fm2_age  .  max  (  ) 


Men  statistics:  Mean:  44.3179821239  Std:  10.0197498572  Median: 
44.0  Min:  19  Max:  72 

Women  statistics:  Mean:  41.877028181  Std:  10.0364418073  Median: 
41.0  Min:  19  Max:  72 


Let  us  visualize  how  many  outliers  are  removed  from  the  whole  data  by: 


p 1 t . f i gur e ( f i g s i z e  = 

(13.4,  5)  ) 

\ 

df . age  [  ( df .  inc ome  =  = 

'  >50K\n  '  )  ] 

. plot ( alpha  = 

.25,  color  = 

'blue  '  ) 

df  2  . age  [  ( df  2  .  income 

=  =  '  >50K\n  '  )  ] 

. plot ( alpha  = 

v 

.45,  color  = 

' red  '  ) 

_ y 
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Fig.  3.4  The  red  shows  the  cleaned  data  without  the  considered  outliers  (in  blue) 


Figure  3.4  shows  the  outliers  in  blue  and  the  rest  of  the  data  in  red.  Visually,  we 
can  confirm  that  we  removed  mainly  outliers  from  the  dataset. 

Next  we  can  see  that  by  removing  the  outliers,  the  difference  between  the  popula¬ 
tions  (men  and  women)  actually  decreased.  In  our  case,  there  were  more  outliers  in 
men  than  women.  If  the  difference  in  the  mean  values  before  removing  the  outliers 
is  2.5,  after  removing  them  it  slightly  decreased  to  2.44: 


Out[20]:  The  mean  difference  with  outliers  is:  2.58. 

The  mean  difference  without  outliers  is:  2.44. 


Let  us  observe  the  difference  of  men  and  women  incomes  in  the  cleaned  subset 
with  some  more  details. 


The  results  are  shown  in  Fig.  3.5.  One  can  see  that  the  differences  between  male 
and  female  values  are  slightly  negative  before  age  42  and  positive  after  it.  Hence, 
women  tend  to  be  promoted  (receive  more  than  50  K)  earlier  than  men. 


3.3  Exploratory  Data  Analysis 


41 


Differences  in  promoting  men  vs.  women 

0  008 


20  30  40  50  60  70 

Age 

Fig.  3.5  Differences  in  high-income  earner  men  versus  women  as  a  function  of  age 


3.3.4  Measuring  Asymmetry:  Skewness  and  Pearson's  Median 
Skewness  Coefficient 


For  univariate  data,  the  formula  for  skewness  is  a  statistic  that  measures  the  asym¬ 
metry  of  the  set  of  n  data  samples,  .v, : 

_  1  £;  (*/  -  M3 
n 

where  /x  is  the  mean,  a  is  the  standard  deviation,  and  n  is  the  number  of  data  points. 

Negative  deviation  indicates  that  the  distribution  “skews  left”  (it  extends  further 
to  the  left  than  to  the  right).  One  can  easily  see  that  the  skewness  for  a  normal 
distribution  is  zero,  and  any  symmetric  data  must  have  a  skewness  of  zero.  Note 
that  skewness  can  be  affected  by  outliers!  A  simpler  alternative  is  to  look  at  the 
relationship  between  the  mean  /x  and  the  median  (i\2. 


(3.3) 
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Out[22]:  Skewness  of  the  male  population  =  0.266444383843 

Skewness  of  the  female  population  =  0.386333524913 

That  is,  the  female  population  is  more  skewed  than  the  male,  probably  since  men 
could  be  most  prone  to  retire  later  than  women. 

The  Pearson’s  median  skewness  coefficient  is  a  more  robust  alternative  to  the 
skewness  coefficient  and  is  defined  as  follows: 

9P  =  3(/i  -  h\i)ct. 


There  are  many  other  definitions  for  skewness  that  will  not  be  discussed  here.  In 
our  case,  if  we  check  the  Pearson’s  skewness  coefficient  for  both  men  and  women, 
we  can  see  that  the  difference  between  them  actually  increases: 


Out[23]:  Pearson's  coefficient  of  the  male  population  =  9.55830402221 

Pearson's  coefficient  of  the  female  population  =  26.4067269073 


3.3.4.1  Discussions 

After  exploring  the  data,  we  obtained  some  apparent  effects  that  seem  to  support 
our  initial  assumptions.  For  example,  the  mean  age  for  men  in  our  dataset  is  39.4 
years;  while  for  women,  is  36.8  years.  When  analyzing  the  high-income  salaries,  the 
mean  age  for  men  increased  to  44.6  years;  while  for  women,  increased  to  42.1  years. 
When  the  data  were  cleaned  from  outliers,  we  obtained  mean  age  for  high-income 
men:  44.3,  and  for  women:  41.8.  Moreover,  histograms  and  other  statistics  show  the 
skewness  of  the  data  and  the  fact  that  women  used  to  be  promoted  a  little  bit  earlier 
than  men,  in  general. 


3.3.5  Continuous  Distribution 

The  distributions  we  have  considered  up  to  now  are  based  on  empirical  observations 
and  thus  are  called  empirical  distributions.  As  an  alternative,  we  may  be  interested 
in  considering  distributions  that  are  defined  by  a  continuous  function  and  are  called 
continuous  distributions  [2].  Remember  that  we  defined  the  PMF,  fx(x),  of  a  discrete 
random  variable  X  as  fx(x)  =  P(X  =  x)  for  all  x.  In  the  case  of  a  continuous 
random  variable  X,  we  speak  of  the  Probability  Density  Function  (PDF),  which 
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Fig.  3.6  Exponential  CDF  (left)  and  PDF  (right)  with  A  =  3.00 


is  defined  as  Fx(x )  where  this  satisfies:  Fx(x)  =  [*  fx(t)6t  for  all  v.  There  are 
many  continuous  distributions;  here,  we  will  consider  the  most  common  ones:  the 
exponential  and  the  normal  distributions. 


3.3.5. 1  The  Exponential  Distribution 

Exponential  distributions  are  well  known  since  they  describe  the  inter-arrival  time 
between  events.  When  the  events  are  equally  likely  to  occur  at  any  time,  the  distri¬ 
bution  of  the  inter-arrival  time  tends  to  an  exponential  distribution.  The  CDF  and  the 
PDF  of  the  exponential  distribution  are  defined  by  the  following  equations: 


CDF(x )  =  l-e 


—Xx 


PDF(x )  =  \e 


—Xx 


The  parameter  A  defines  the  shape  of  the  distribution.  An  example  is  given  in 
Fig.  3.6.  It  is  easy  to  show  that  the  mean  of  the  distribution  is  j,  the  variance  is 

and  the  median  is  . 

Note  that  for  a  small  number  of  samples,  it  is  difficult  to  see  that  the  exact  empirical 
distribution  fits  a  continuous  distribution.  The  best  way  to  observe  this  match  is  to 
generate  samples  from  the  continuous  distribution  and  see  if  these  samples  match 
the  data.  As  an  exercise,  you  can  consider  the  birthdays  of  a  large  enough  group  of 
people,  sorting  them  and  computing  the  inter-arrival  time  in  days.  If  you  plot  the 
CDF  of  the  inter-arrival  times,  you  will  observe  the  exponential  distribution. 

There  are  a  lot  of  real-world  events  that  can  be  described  with  this  distribution, 
including  the  time  until  a  radioactive  particle  decays;  the  time  it  takes  before  your 
next  telephone  call;  and  the  time  until  default  (on  payment  to  company  debt  holders) 
in  reduced-form  credit  risk  modeling.  The  random  variable  X  of  the  lifetime  of  some 
batteries  is  associated  with  a  probability  density  function  of  the  form:  PDF(x )  = 
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Fig.  3.7  Normal  PDF  with  fi  =  6  and  a  =  2 


33.5.2  The  Normal  Distribution 

The  normal  distribution ,  also  called  the  Gaussian  distribution ,  is  the  most  common 
since  it  represents  many  real  phenomena:  economic,  natural,  social,  and  others.  Some 
well-known  examples  of  real  phenomena  with  a  normal  distribution  are  as  follows: 

•  The  size  of  living  tissue  (length,  height,  weight). 

•  The  length  of  inert  appendages  (hair,  nails,  teeth)  of  biological  specimens. 

•  Different  physiological  measurements  (e.g.,  blood  pressure),  etc. 


The  normal  CDF  has  no  closed-form  expression  and  its  most  common  represen¬ 
tation  is  the  PDF: 


1  _ 

PDF(v)  =  e  2a2 


The  parameter  a  defines  the  shape  of  the  distribution.  An  example  of  the  PDF  of 
a  normal  distribution  with  /i  =  6  and  a  =  2  is  given  in  Fig.  3.7. 


3.3.6  Kernel  Density 

In  many  real  problems,  we  may  not  be  interested  in  the  parameters  of  a  particular 
distribution  of  data,  but  just  a  continuous  representation  of  the  data.  In  this  case, 
we  should  estimate  the  distribution  non-parametrically  (i.e.,  making  no  assumptions 
about  the  form  of  the  underlying  distribution)  using  kernel  density  estimation.  Let  us 
imagine  that  we  have  a  set  of  data  measurements  without  knowing  their  distribution 
and  we  need  to  estimate  the  continuous  representation  of  their  distribution.  In  this 
case,  we  can  consider  a  Gaussian  kernel  to  generate  the  density  around  the  data.  Let 
us  consider  a  set  of  random  data  generated  by  a  bimodal  normal  distribution.  If  we 
consider  a  Gaussian  kernel  around  the  data,  the  sum  of  those  kernels  can  give  us 
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Fig.  3.8  Summed  kernel  functions  around  a  random  set  of  points  (left)  and  the  kernel  density 
estimate  with  the  optimal  bandwidth  (right)  for  our  dataset.  Random  data  shown  in  blue,  kernel 
shown  in  black  and  summed  function  shown  in  red 


a  continuous  function  that  when  normalized  would  approximate  the  density  of  the 
distribution: 


Figure  3.8  (left)  shows  the  result  of  the  construction  of  the  continuous  function 
from  the  kernel  summarization. 

In  fact,  the  library  SciPy  implements  a  Gaussian  kernel  density  estimation  that 
automatically  chooses  the  appropriate  bandwidth  parameter  for  the  kernel.  Thus,  the 
final  construction  of  the  density  estimate  will  be  obtained  by: 


3  http :  // w  w  w.  scipy.  org . 
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In  [25] 


Figure  3.8  (right)  shows  the  result  of  the  kernel  density  estimate  for  our  example. 


3.4  Estimation 

An  important  aspect  when  working  with  statistical  data  is  being  able  to  use  estimates 
to  approximate  the  values  of  unknown  parameters  of  the  dataset.  In  this  section,  we 
will  review  different  kinds  of  estimators  (estimated  mean,  variance,  standard  score, 
etc.). 

3.4.1  Sample  and  Estimated  Mean,  Variance  and  Standard  Scores 

In  continuation,  we  will  deal  with  point  estimators  that  are  single  numerical  estimates 
of  parameters  of  a  population. 


3.4.1. 1  Mean 

Let  us  assume  that  we  know  that  our  data  are  coming  from  a  normal  distribution  and 
the  random  samples  drawn  are  as  follows: 

{0.33,  -1.76,2.34,0.56,0.89}. 

The  question  is  can  we  guess  the  mean  /x  of  the  distribution?  One  approximation  is 
given  by  the  sample  mean,  v.  This  process  is  called  estimation  and  the  statistic  (e.g., 
the  sample  mean)  is  called  an  estimator.  In  our  case,  the  sample  mean  is  0.472,  and  it 
seems  a  logical  choice  to  represent  the  mean  of  the  distribution.  It  is  not  so  evident  if 
we  add  a  sample  with  a  value  of  —465.  In  this  case,  the  sample  mean  will  be  —77. 1 1 , 
which  does  not  look  like  the  mean  of  the  distribution.  The  reason  is  due  to  the  fact 
that  the  last  value  seems  to  be  an  outlier  compared  to  the  rest  of  the  sample.  In  order 
to  avoid  this  effect,  we  can  try  first  to  remove  outliers  and  then  to  estimate  the  mean; 
or  we  can  use  the  sample  median  as  an  estimator  of  the  mean  of  the  distribution. 
If  there  are  no  outliers,  the  sample  mean  x  minimizes  the  following  mean  squared 
error: 


MSE  =  -  V(i  -  iij1, 
n  L ' 


where  n  is  the  number  of  times  we  estimate  the  mean. 
Let  us  compute  the  MSE  of  a  set  of  random  data: 
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Out [26]:  MSE:  0.00019879541147 


3.4.1 .2  Variance 

If  we  ask  ourselves  what  is  the  variance,  a2,  of  the  distribution  of  X ,  analogously  we 
can  use  the  sample  variance  as  an  estimator.  Let  us  denote  by  a2  the  sample  variance 
estimator: 


-  y>  -x)2. 

n  L — ' 


For  large  samples,  this  estimator  works  well,  but  for  a  small  number  of  samples 
it  is  biased.  In  those  cases,  a  better  estimator  is  given  by: 


1 


-  x)2. 


3.4.1 .3  Standard  Score 

In  many  real  problems,  when  we  want  to  compare  data,  or  estimate  their  correlations 
or  some  other  kind  of  relations,  we  must  avoid  data  that  come  in  different  units. 
For  example,  weight  can  come  in  kilograms  or  grams.  Even  data  that  come  in  the 
same  units  can  still  belong  to  different  distributions.  We  need  to  normalize  them  to 
standard  scores.  Given  a  dataset  as  a  series  of  values,  {v/},  we  convert  the  data  to 
standard  scores  by  subtracting  the  mean  and  dividing  them  by  the  standard  deviation: 

( Xi  -  M) 

Zi  =  - • 

( 7 

Note  that  this  measure  is  dimensionless  and  its  distribution  has  a  mean  of  0  and 
variance  of  1.  It  inherits  the  “shape”  of  the  dataset:  if  X  is  normally  distributed,  so 
is  Z;  if  X  is  skewed,  so  is  Z. 

3.4.2  Covariance,  and  Pearson's  and  Spearman's  Rank  Correlation 

Variables  of  data  can  express  relations.  For  example,  countries  that  tend  to  invest  in 
research  also  tend  to  invest  more  in  education  and  health.  This  kind  of  relationship 
is  captured  by  the  covariance. 
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World  Oil  Product! on(T)  Economic  growth(T) 


Fig.  3.9  Positive  correlation  between  economic  growth  and  stock  market  returns  worldwide  (left). 
Negative  correlation  between  the  world  oil  production  and  gasoline  prices  worldwide  (right) 


3.4.2. 1  Covariance 

When  two  variables  share  the  same  tendency,  we  speak  about  covariance.  Let  us 
consider  two  series,  {v/}  and  {y/}.  Let  us  center  the  data  with  respect  to  their  mean: 
dxi  =  Xi  —  fix  and  dyi  =  yi  —  /ip.  It  is  easy  to  show  that  when  {v/}  and  {y/}  vary 
together,  their  deviations  tend  to  have  the  same  sign.  The  covariance  is  defined  as 
the  mean  of  the  following  products: 


Cov(X ,  Y) 


1 

n 


n 

Y.  dxj  dyt , 
i= 1 


where  n  is  the  length  of  both  sets.  Still,  the  covariance  itself  is  hard  to  interpret. 


3.4.2.2  Correlation  and  the  Pearson's  Correlation 

If  we  normalize  the  data  with  respect  to  their  deviation,  that  leads  to  the  standard 
scores;  and  then  multiplying  them,  we  get: 

Xi  —  dx  yi  —  dY 

Pi  —  • 

ax  cry 

The  mean  of  this  product  is  p  =  Y^i= l  Pi-  Equivalently,  we  can  rewrite  p  in 
terms  of  the  covariance,  and  thus  obtain  the  Pearson’s  correlation : 

Cov(X ,  Y) 

P=  - • 

axcry 

Note  that  the  Pearson’s  correlation  is  always  between  —1  and  +1,  where  the 
magnitude  depends  on  the  degree  of  correlation.  If  the  Pearson’s  correlation  is  1  (or 
—  1),  it  means  that  the  variables  are  perfectly  correlated  (positively  or  negatively) 
(see  Fig.  3.9).  This  means  that  one  variable  can  predict  the  other  very  well.  However, 
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Fig.  3.10  Anscombe  configurations 


having  p  =  0,  does  not  necessarily  mean  that  the  variables  are  not  correlated!  Pear¬ 
son’s  correlation  captures  correlations  of  first  order,  but  not  nonlinear  correlations. 
Moreover,  it  does  not  work  well  in  the  presence  of  outliers. 


3.4.2.3  Spearman's  Rank  Correlation 

The  Spearman’s  rank  correlation  comes  as  a  solution  to  the  robustness  problem  of 
Pearson’s  correlation  when  the  data  contain  outliers.  The  main  idea  is  to  use  the 
ranks  of  the  sorted  sample  data,  instead  of  the  values  themselves.  For  example,  in 
the  list  [4,  3,  7,  5],  the  rank  of  4  is  2,  since  it  will  appear  second  in  the  ordered  list 
([3,  4,  5,  7]).  Spearman’s  correlation  computes  the  correlation  between  the  ranks 
of  the  data.  For  example,  considering  the  data:  X  =  [10,  20,  30,  40,  1000],  and 
Y  =  [—70,  —1000,  —50,  —10,  —20],  where  we  have  an  outlier  in  each  one  set.  If 
we  compute  the  ranks,  they  are  [1.0,  2.0,  3.0, 4.0,  5.0]  and  [2.0, 1.0,  3.0,  5.0, 4.0].  As 
value  of  the  Pearson’s  coefficient,  we  get  0.28,  which  does  not  show  much  correlation 
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between  the  sets.  However,  the  Spearman’s  rank  coefficient,  capturing  the  correlation 
between  the  ranks,  gives  as  a  final  value  of  0.80,  confirming  the  correlation  between 
the  sets.  As  an  exercise,  you  can  compute  the  Pearson’s  and  the  Spearman’s  rank 
correlations  for  the  different  Anscombe  configurations  given  in  Fig.  3.10.  Observe  if 
linear  and  nonlinear  correlations  can  be  captured  by  the  Pearson’s  and  the  Spearman’s 
rank  correlations. 


3.5  Conclusions 

In  this  chapter,  we  have  familiarized  ourselves  with  the  basic  concepts  and  procedures 
of  descriptive  statistics  to  explore  a  dataset.  As  we  have  seen,  it  helps  us  to  understand 
the  experiment  or  a  dataset  in  detail  and  allows  us  to  put  the  data  in  perspective.  We 
introduced  the  central  measures  of  tendency  such  as  the  sample  mean  and  median; 
and  measures  of  variability  such  as  the  variance  and  standard  deviation.  We  have  also 
discussed  how  these  measures  can  be  affected  by  outliers.  In  order  to  go  deeper  into 
visualizing  the  dataset,  we  have  introduced  histograms,  quantiles,  and  percentiles. 

In  many  situations,  when  the  values  are  continuous  variables,  it  is  convenient  to 
use  continuous  distributions;  the  most  common  of  which  are  the  normal  and  the 
exponential  distributions.  The  advantage  of  most  continuous  distributions  is  that 
we  can  have  an  explicit  expression  for  their  PDF  and  CDF,  as  well  as  the  mean 
and  variance  in  terms  of  a  closed  formula.  Also,  we  learned  how,  by  using  the 
kernel  density,  we  can  obtain  a  continuous  representation  of  the  sample  distribution. 
Finally,  we  discussed  how  to  estimate  the  correlation  and  the  covariance  of  datasets, 
where  two  of  the  most  popular  measures  are  the  Pearson’s  and  the  Spearman’s  rank 
correlations,  which  are  affected  in  different  ways  by  the  outliers  of  the  dataset. 
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4.1  Introduction 

There  is  not  only  one  way  to  address  the  problem  of  statistical  inference.  In  fact, 
there  are  two  main  approaches  to  statistical  inference:  the  frequentist  and  Bayesian 
approaches.  Their  differences  are  subtle  but  fundamental: 

•  In  the  case  of  the  frequentist  approach ,  the  main  assumption  is  that  there  is  a 
population,  which  can  be  represented  by  several  parameters,  from  which  we  can 
obtain  numerous  random  samples.  Population  parameters  are  fixed  but  they  are 
not  accessible  to  the  observer.  The  only  way  to  derive  information  about  these 
parameters  is  to  take  a  sample  of  the  population,  to  compute  the  parameters  of  the 
sample,  and  to  use  statistical  inference  techniques  to  make  probable  propositions 
regarding  population  parameters. 

•  The  Bayesian  approach  is  based  on  a  consideration  that  data  are  fixed,  not  the  result 
of  a  repeatable  sampling  process,  but  parameters  describing  data  can  be  described 
probabilistically.  To  this  end,  Bayesian  inference  methods  focus  on  producing 
parameter  distributions  that  represent  all  the  knowledge  we  can  extract  from  the 
sample  and  from  prior  information  about  the  problem. 

A  deep  understanding  of  the  differences  between  these  approaches  is  far  beyond 
the  scope  of  this  chapter,  but  there  are  many  interesting  references  that  will  enable 
you  to  learn  about  it  [1].  What  is  really  important  is  to  realize  that  the  approaches 
are  based  on  different  assumptions  which  determine  the  validity  of  their  inferences. 
The  assumptions  are  related  in  the  first  case  to  a  sampling  process;  and  to  a  statistical 
model  in  the  second  case.  Correct  inference  requires  these  assumptions  to  be  correct. 
The  fulfillment  of  this  requirement  is  not  part  of  the  method,  but  it  is  the  responsibility 
of  the  data  scientist. 

In  this  chapter,  to  keep  things  simple,  we  will  only  deal  with  the  first  approach, 
but  we  suggest  the  reader  also  explores  the  second  approach  as  it  is  well  worth  it! 
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4.2  Statistical  lnference:The  Frequentist  Approach 

As  we  have  said,  the  ultimate  objective  of  statistical  inference,  if  we  adopt  the  fre¬ 
quentist  approach,  is  to  produce  probable  propositions  concerning  population  param¬ 
eters  from  analysis  of  a  sample.  The  most  important  classes  of  propositions  are  as 
follows: 

•  Propositions  about  point  estimates.  A  point  estimate  is  a  particular  value  that  best 
approximates  some  parameter  of  interest.  For  example,  the  mean  or  the  variance 
of  the  sample. 

•  Propositions  about  confidence  intervals  or  set  estimates.  A  confidence  interval  is 
a  range  of  values  that  best  represents  some  parameter  of  interest. 

•  Propositions  about  the  acceptance  or  rejection  of  a  hypothesis. 

In  all  these  cases,  the  production  of  propositions  is  based  on  a  simple  assumption: 
we  can  estimate  the  probability  that  the  result  represented  by  the  proposition  has 
been  caused  by  chance.  The  estimation  of  this  probability  by  sound  methods  is  one 
of  the  main  topics  of  statistics. 

The  development  of  traditional  statistics  was  limited  by  the  scarcity  of  computa¬ 
tional  resources.  In  fact,  the  only  computational  resources  were  mechanical  devices 
and  human  computers,  teams  of  people  devoted  to  undertaking  long  and  tedious 
calculations.  Given  these  conditions,  the  main  results  of  classical  statistics  are  theo¬ 
retical  approximations,  based  on  idealized  models  and  assumptions,  to  measure  the 
effect  of  chance  on  the  statistic  of  interest.  Thus,  concepts  such  as  the  Central  Limit 
Theorem ,  the  empirical  sample  distribution  or  the  t-test  are  central  to  understanding 
this  approach. 

The  development  of  modern  computers  has  opened  an  alternative  strategy  for 
measuring  chance  that  is  based  on  simulation;  producing  computationally  inten¬ 
sive  methods  including  resampling  methods  (such  as  bootstrapping),  Markov  chain 
Monte  Carlo  methods,  etc.  The  most  interesting  characteristic  of  these  methods  is 
that  they  allow  us  to  treat  more  realistic  models. 


4.3  Measuring  the  Variability  in  Estimates 

Estimates  produced  by  descriptive  statistics  are  not  equal  to  the  truth  but  they  are 
better  as  more  data  become  available.  So,  it  makes  sense  to  use  them  as  central 
elements  of  our  propositions  and  to  measure  its  variability  with  respect  to  the  sample 


size. 
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In  [1]  : 


Out  [  1  ]  : 


4.3.1  Point  Estimates 

Let  us  consider  a  dataset  of  accidents  in  Barcelona  in  2013.  This  dataset  can  be 
downloaded  from  the  OpenDataBCN  website,1  Barcelona  City  Hall’s  open  data 
service.  Each  register  in  the  dataset  represents  an  accident  via  a  series  of  features: 
weekday,  hour,  address,  number  of  dead  and  injured  people,  etc.  This  dataset  will 
represent  our  population:  the  set  of  all  reported  traffic  accidents  in  Barcelona  during 
2013. 


4.3.1 .1  Sampling  Distribution  of  Point  Estimates 

Let  us  suppose  that  we  are  interested  in  describing  the  daily  number  of  traffic  acci¬ 
dents  in  the  streets  of  Barcelona  in  2013.  If  we  have  access  to  the  population ,  the 
computation  of  this  parameter  is  a  simple  operation:  the  total  number  of  accidents 
divided  by  365. 


data  =  pd . read_csv (" files / ch04 / ACCIDENTS_GU_BCN_ 

2  013  . csv  "  ) 

\ 

data  [  '  Date  '  ]  =  data  [u  '  Dia  de  mes  '  ]  .  apply  (  lambda 

x  :  s  t  r  (  x  )  ) 

+  '  -  '  + 

data  [u  'Mes  de  any  '  ]  .  app ly  (  1  ambda 

x  :  s  t  r  (  x  )  ) 

data [ ' Date ' ]  =  pd . to_datetime ( data [' Date '] ) 

accidents  =  data . groupby ( [ ' Date ' ] ) . size ( ) 

V 

print  ac c i den t s  . mean  (  ) 

_ y 

Mean:  25.9095 

But  now,  for  illustrative  purposes,  let  us  suppose  that  we  only  have  access  to  a 
limited  part  of  the  data  (the  sample ):  the  number  of  accidents  during  some  days  of 
2013.  Can  we  still  give  an  approximation  of  the  population  mean? 

The  most  intuitive  way  to  go  about  providing  such  a  mean  is  simply  to  take  the 
sample  mean .  The  sample  mean  is  a  point  estimate  of  the  population  mean.  If  we  can 
only  choose  one  value  to  estimate  the  population  mean,  then  this  is  our  best  guess. 

The  problem  we  face  is  that  estimates  generally  vary  from  one  sample  to  another, 
and  this  sampling  variation  suggests  our  estimate  may  be  close,  but  it  will  not  be 
exactly  equal  to  our  parameter  of  interest.  How  can  we  measure  this  variability? 

In  our  example,  because  we  have  access  to  the  population,  we  can  empirically  build 
the  sampling  distribution  of  the  sample  mean2  for  a  given  number  of  observations. 
Then,  we  can  use  the  sampling  distribution  to  compute  a  measure  of  the  variability. 

In  Fig.  4. 1,  we  can  see  the  empirical  sample  distribution  of  the  mean  for  s  =  10.000 
samples  with  n  =  200  observations  from  our  dataset.  This  empirical  distribution  has 
been  built  in  the  following  way: 


1  http://opendata.bcn.cat/. 

2  Suppose  that  we  draw  all  possible  samples  of  a  given  size  from  a  given  population.  Suppose  further 
that  we  compute  the  mean  for  each  sample.  The  probability  distribution  of  this  statistic  is  called  the 
mean  sampling  distribution. 
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Fig.  4.1  Empirical  distribution  of  the  sample  mean.  In  red ,  the  mean  value  of  this  distribution 


1.  Draw  s  (a  large  number)  independent  samples  {x1,  . . . ,  xs]  from  the  population 
where  each  element  x-7  is  composed  of 

2.  Evaluate  the  sample  mean  /i-7  =  ^  Y%=  l  x!  °f  eac^  sample. 

3.  Estimate  the  sampling  distribution  of  jl  by  the  empirical  distribution  of  the  sample 
replications. 


In  [2]  : 


c - \ 

#  populat ion 

df  =  accidents . to_frame ( ) 

N_t  e  s  t  =  10000 

e 1 emen t  s  =  2  0  0 

#  mean  array  of  samples 

means  =  [0]  *  N_test 

#  sample  generation 

for  i  in  range (N_test ) : 

rows  =  np . random . cho i c e ( df . i ndex . va lue s ,  elements) 
sampled_df  =  df .  ix  [  rows ] 
means [ i ]  =  sampled_df . mean ( ) 

V _ / 


In  general,  given  a  point  estimate  from  a  sample  of  size  n ,  we  define  its  sampling 
distribution  as  the  distribution  of  the  point  estimate  based  on  samples  of  size  n 
from  its  population.  This  definition  is  valid  for  point  estimates  of  other  population 
parameters,  such  as  the  population  median  or  population  standard  deviation,  but  we 
will  focus  on  the  analysis  of  the  sample  mean. 

The  sampling  distribution  of  an  estimate  plays  an  important  role  in  understanding 
the  real  meaning  of  propositions  concerning  point  estimates.  It  is  very  useful  to  think 
of  a  particular  point  estimate  as  being  drawn  from  such  a  distribution. 


4.3.1 .2  The  Traditional  Approach 

In  real  problems,  we  do  not  have  access  to  the  real  population  and  so  estimation 
of  the  sampling  distribution  of  the  estimate  from  the  empirical  distribution  of  the 
sample  replications  is  not  an  option.  But  this  problem  can  be  solved  by  making  use 
of  some  theoretical  results  from  traditional  statistics. 


4.3  Measuring  the  Variability  in  Estimates 
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In  [3]  : 


Out  [  3  ]  : 


It  can  be  mathematically  shown  that  given  n  independent  observations  {xi  };=i,. ,}W 
of  a  population  with  a  standard  deviation  ax ,  the  standard  deviation  of  the  sample 
mean  ox ,  or  standard  error ,  can  be  approximated  by  this  formula: 


SE  = 


The  demonstration  of  this  result  is  based  on  the  Central  Limit  Theorem:  an  old 
theorem  with  a  history  that  starts  in  1810  when  Laplace  released  his  first  paper  on  it. 

This  formula  uses  the  standard  deviation  of  the  population  ox ,  which  is  not  known, 
but  it  can  be  shown  that  if  it  is  substituted  by  its  empirical  estimate  dx ,  the  estimation 
is  sufficiently  good  if  n  >  30  and  the  population  distribution  is  not  skewed.  This 
allows  us  to  estimate  the  standard  error  of  the  sample  mean  even  if  we  do  not  have 
access  to  the  population. 

So,  how  can  we  give  a  measure  of  the  variability  of  the  sample  mean?  The  answer 
is  simple:  by  giving  the  empirical  standard  error  of  the  mean  distribution. 

f  \ 

rows  =  np . random . choice  ( df  .  index . values  ,  2  00) 

sampled_df  =  df . ix [ rows ] 

e  s  t_s i gma_mean  =  s  amp 1 ed_d  f  .  std  ()  /math,  sqrt  (200) 

print  'Direct  estimation  of  SE  from  one  sample  of 
200  elements : ' ,  e s t_s i gma_me an [0] 
print  'Estimation  of  the  SE  by  simulating  10000  samples  of 
200  elements:',  np . array  (means )  .  std  (  ) 

v _ y 


Direct  estimation  of  SE  from  one  sample  of  200  elements:  0.6536 
Estimation  of  the  SE  by  simulating  10000  samples  of  200 
elements:  0.6362 

Unlike  the  case  of  the  sample  mean,  there  is  no  formula  for  the  standard  error  of 
other  interesting  sample  estimates,  such  as  the  median. 


4.3.1 .3  The  Computationally  Intensive  Approach 

Let  us  consider  from  now  that  our  full  dataset  is  a  sample  from  a  hypothetical 
population  (this  is  the  most  common  situation  when  analyzing  real  data!). 

A  modern  alternative  to  the  traditional  approach  to  statistical  inference  is  the 
bootstrapping  method  [2].  In  the  bootstrap,  we  draw  n  observations  with  replacement 
from  the  original  data  to  create  a  bootstrap  sample  or  resample.  Then,  we  can  calculate 
the  mean  for  this  resample.  By  repeating  this  process  a  large  number  of  times,  we 
can  build  a  good  approximation  of  the  mean  sampling  distribution  (see  Fig.  4.2). 
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Fig.  4.2  Mean  sampling  distribution  by  bootstrapping.  In  red ,  the  mean  value  of  this  distribution 


In  [4]  : 


/ - \ 

def  meanBootstrap (X,  numberb ) : 
x  =  [0]*numberb 
for  i  in  range (numberb) : 
samp 1 e  =  [  X  [  j  ] 

for  j 

in  np . random .  randint  ( len  (X)  ,  size  =  len  (X)  ) 

] 

x[i]  =  np . mean ( s amp 1 e ) 

return  x 

m  =  meanBootstrap (accidents ,  10000) 

print  "Mean  estimate:",  np .mean ( m ) 

v _ y 


Out [4]:  Mean  estimate:  25.9094 

The  basic  idea  of  the  bootstrapping  method  is  that  the  observed  sample  contains 
sufficient  information  about  the  underlying  distribution.  So,  the  information  we  can 
extract  from  resampling  the  sample  is  a  good  approximation  of  what  can  be  expected 
from  resampling  the  population. 

The  bootstrapping  method  can  be  applied  to  other  simple  estimates  such  as  the 
median  or  the  variance  and  also  to  more  complex  operations  such  as  estimates  of 
censored  data. 


4.3.2  Confidence  Intervals 

A  point  estimate  (9,  such  as  the  sample  mean,  provides  a  single  plausible  value  for 
a  parameter.  However,  as  we  have  seen,  a  point  estimate  is  rarely  perfect;  usually 
there  is  some  error  in  the  estimate.  That  is  why  we  have  suggested  using  the  standard 
error  as  a  measure  of  its  variability. 

Instead  of  that,  a  next  logical  step  would  be  to  provide  a  plausible  range  of  values 
for  the  parameter.  A  plausible  range  of  values  for  the  sample  parameter  is  called  a 
confidence  interval. 


3 


Censoring  is  a  condition  in  which  the  value  of  observation  is  only  partially  known. 
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In  [  5  ]  : 


Out  [  5  ]  : 


We  will  base  the  definition  of  confidence  interval  on  two  ideas: 

1 .  Our  point  estimate  is  the  most  plausible  value  of  the  parameter,  so  it  makes  sense 
to  build  the  confidence  interval  around  the  point  estimate. 

2.  The  plausibility  of  a  range  of  values  can  be  defined  from  the  sampling  distribution 
of  the  estimate. 

For  the  case  of  the  mean,  the  Central  Limit  Theorem  states  that  its  sampling 
distribution  is  normal: 

Theorem  4.1  Given  a  population  with  a  finite  mean  p  and  a  finite  non-zero  variance 
a2,  the  sampling  distribution  of  the  mean  approaches  a  normal  distribution  with  a 
mean  of  p  and  a  variance  of  o2 In  as  n,  the  sample  size,  increases. 

In  this  case,  and  in  order  to  define  an  interval,  we  can  make  use  of  a  well-known 
result  from  probability  that  applies  to  normal  distributions:  roughly  95%  of  the  time 
our  estimate  will  be  within  1.96  standard  errors  of  the  true  mean  of  the  distribution. 
If  the  interval  spreads  out  1.96  standard  errors  from  a  normally  distributed  point 
estimate,  intuitively  we  can  say  that  we  are  roughly  95%  confident  that  we  have 
captured  the  true  parameter. 

Cl  =  [&  -  1.96  x  SE,  0  +  1.96  x  SE] 

/  \ 

m  =  ac c i den t s  . mean  (  ) 

se  =  a c  c  i  den  t  s  .  s  t  d  (  )  /  ma  th  .  s qr  t  (  1  en  (  ac  c  i  den  t  s  )  ) 
ci  =  [m  -  se  *  1 . 9  6  ,  m  +  se  *  1 . 9  6 ] 
print  "Confidence  interval:",  ci 

V _ / 


Confidence  interval:  [24.975,  26.8440] 

Suppose  we  want  to  consider  confidence  intervals  where  the  confidence  level  is 
somewhat  higher  than  95%:  perhaps  we  would  like  a  confidence  level  of  99%.  To 
create  a  99%  confidence  interval,  change  1 .96  in  the  95%  confidence  interval  formula 
to  be  2.58  (it  can  be  shown  that  99%  of  the  time  a  normal  random  variable  will  be 
within  2.58  standard  deviations  of  the  mean). 

In  general,  if  the  point  estimate  follows  the  normal  model  with  standard  error  SE , 
then  a  confidence  interval  for  the  population  parameter  is 

&  ±z  x  SE 

where  z  corresponds  to  the  confidence  level  selected: 


Confidence  Level 

90% 

95% 

99% 

99.9% 

z  Value 

1.65 

1.96 

2.58 

3.291 

This  is  how  we  would  compute  a  95%  confidence  interval  of  the  sample  mean 
using  bootstrapping: 
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1.  Repeat  the  following  steps  for  a  large  number,  s,  of  times: 

a.  Draw  n  observations  with  replacement  from  the  original  data  to  create  a 
bootstrap  sample  or  resample. 

b.  Calculate  the  mean  for  the  resample. 

2.  Calculate  the  mean  of  your  s  values  of  the  sample  statistic.  This  process  gives 
you  a  “bootstrapped”  estimate  of  the  sample  statistic. 

3.  Calculate  the  standard  deviation  of  your  s  values  of  the  sample  statistic.  This 
process  gives  you  a  “bootstrapped”  estimate  of  the  SE  of  the  sample  statistic. 

4.  Obtain  the  2.5th  and  97.5th  percentiles  of  your  s  values  of  the  sample  statistic. 


In  [  6  ]  : 


f  \ 

m  =  meanBootstrap (accidents ,  10000) 

sample_mean  =  np.mean(m) 
sample_se  =  np.std(m) 

print  "Mean  estimate : " ,  sample_mean 
print  " SE  of  the  estimate:",  sample_se 

c i  =  [ np . percentile (m,  2.5) ,  np . percentile (m,  97.5)] 

print  "Confidence  interval:",  ci 

\ _ / 


Out [6]:  Mean  estimate:  25.9039 

SE  of  the  estimate:  0.4705 
Confidence  interval:  [24.9834,  26.8219] 


4.3.2. 1  But  What  Does  "95%  Confident"  Mean? 

The  real  meaning  of  “confidence”  is  not  evident  and  it  must  be  understood  from  the 
point  of  view  of  the  generating  process. 

Suppose  we  took  many  (infinite)  samples  from  a  population  and  built  a  95% 
confidence  interval  from  each  sample.  Then  about  95%  of  those  intervals  would 
contain  the  actual  parameter.  In  Fig.  4.3  we  show  how  many  confidence  intervals 
computed  from  100  different  samples  of  100  elements  from  our  dataset  contain  the 
real  population  mean.  If  this  simulation  could  be  done  with  infinite  different  samples, 
5%  of  those  intervals  would  not  contain  the  true  mean. 

So,  when  faced  with  a  sample,  the  correct  interpretation  of  a  confidence  interval 
is  as  follows: 

In  95%  of  the  cases,  when  I  compute  the  95%  confidence  interval  from  this  sample,  the  true 

mean  of  the  population  will  fall  within  the  interval  defined  by  these  bounds:  ±1.96  x  SE. 

We  cannot  say  either  that  our  specific  sample  contains  the  true  parameter  or  that 
the  interval  has  a  95%  chance  of  containing  the  true  parameter.  That  interpretation 
would  not  be  correct  under  the  assumptions  of  traditional  statistics. 


4.4  Hypothesis  Testing 
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4.4  Hypothesis  Testing 

Giving  a  measure  of  the  variability  of  our  estimates  is  one  way  of  producing  a 
statistical  proposition  about  the  population,  but  not  the  only  one.  R.A.  Fisher  (1890- 
1962)  proposed  an  alternative,  known  as  hypothesis  testing ,  that  is  based  on  the 
concept  of  statistical  significance. 

Let  us  suppose  that  a  deeper  analysis  of  traffic  accidents  in  Barcelona  results  in  a 
difference  between  2010  and  2013.  Of  course,  the  difference  could  be  caused  only 
by  chance,  because  of  the  variability  of  both  estimates.  But  it  could  also  be  the  case 
that  traffic  conditions  were  very  different  in  Barcelona  during  the  two  periods  and, 
because  of  that,  data  from  the  two  periods  can  be  considered  as  belonging  to  two 
different  populations.  Then,  the  relevant  question  is:  Are  the  observed  effects  real  or 
not? 

Technically,  the  question  is  usually  translated  to:  Were  the  observed  effects  statis¬ 
tically  significant ? 

The  process  of  determining  the  statistical  significance  of  an  effect  is  called  hypoth¬ 
esis  testing. 

This  process  starts  by  simplifying  the  options  into  two  competing  hypotheses: 

•  Ho’.  The  mean  number  of  daily  traffic  accidents  is  the  same  in  2010  and  2013 
(there  is  only  one  population,  one  true  mean,  and  2010  and  2013  are  just  different 
samples  from  the  same  population). 

•  Ha  ’.  The  mean  number  of  daily  traffic  accidents  in  2010  and  2013  is  different 
(2010  and  2013  are  two  samples  from  two  different  populations). 
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Sample  (with  100  observations). 

Fig.  4.3  This  graph  shows  100  sample  means  {green  points)  and  its  corresponding  confidence 
intervals,  computed  from  100  different  samples  of  100  elements  from  our  dataset.  It  can  be  observed 
that  a  few  of  them  (those  in  red )  do  not  contain  the  mean  of  the  population  {black  horizontal  line ) 


Confidence  interval  for  the  samples  mean  estimate 
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In  [7]  : 


Out  [  7  ]  : 


We  call  Ho  the  null  hypothesis  and  it  represents  a  skeptical  point  of  view:  the 
effect  we  have  observed  is  due  to  chance  (due  to  the  specific  sample  bias).  Ha  is  the 
alternative  hypothesis  and  it  represents  the  other  point  of  view:  the  effect  is  real. 

The  general  rule  of  frequentist  hypothesis  testing:  we  will  not  discard  Ho  (and 
hence  we  will  not  consider  Ha)  unless  the  observed  effect  is  implausible  under  Ho . 


4.4.1  Testing  Hypotheses  Using  Confidence  Intervals 

We  can  use  the  concept  represented  by  confidence  intervals  to  measure  the  plausi¬ 
bility  of  a  hypothesis. 

We  can  illustrate  the  evaluation  of  the  hypothesis  setup  by  comparing  the  mean 
rate  of  traffic  accidents  in  Barcelona  during  2010  and  2013: 

c - \ 

data  =  pd . read_csv ("  files / ch0  4 / ACCIDENTS_GU_BCN_2  010  . csv"  , 

encoding= ' latin-1 ' ) 

#  Create  a  new  column  which  is  the  date 

data  [  '  Date  '  ]  =  data  [  '  Dia  de  mes  '  ]  .  apply  (  lambda  x:  str  (x)  ) 

+  '  -  '  + 

data  [  '  Mes  de  any  '  ]  .  app  ly  (  1  ambda  x  :  s  t  r  (  x  )  ) 
data2  =  data  [  ' Date '  ] 

counts2010  =  data [' Date ']. value_counts ( ) 
print  '2010:  Mean',  c oun t s 2 0 1 0 . mean ( ) 

data  =  pd .  read_csv ("  files / ch0  4 / ACCIDENTS_GU_BCN_2  013  . csv"  , 

encoding= ' latin-1 ' ) 

#  Create  a  new  column  which  is  the  date 

data  ['Date']  =  data  [  '  Dia  de  mes  '  ]  .  apply  (  1  ambda  x  :  s  t  r  (  x  )  ) 

+  '  -  '  + 

data  [  '  Mes  de  any  '  ]  .  app  ly  (  1  ambda  x  :  s  t  r  (  x  )  ) 
data2  =  data  [  ' Date '  ] 

counts2013  =  data [' Date ']. value_counts  (  ) 
print  '2013:  Mean',  counts2013 . mean ( ) 

v _ 


2010:  Mean  24.8109 
2013:  Mean  25.9095 


This  estimate  suggests  that  in  2013  the  mean  rate  of  traffic  accidents  in  Barcelona 
was  higher  than  it  was  in  2010.  But  is  this  effect  statistically  significant? 

Based  on  our  sample,  the  95%  confidence  interval  for  the  mean  rate  of  traffic 
accidents  in  Barcelona  during  2013  can  be  calculated  as  follows: 

f  \ 

n  =  1 en ( c oun t s 2 0 1 3  ) 

mean  =  c oun t s 2 0 1 3  . mean  (  ) 
s  =  count s2 0 1 3  .  s td  (  ) 

ci  =  [mean  -  s  *  1  .  9  6  /  np  .sqrt(n),  mean  +  s  *  1  .  9  6  /  np  .  sqr  t  (  n  )  ] 
print  '2010  accident  rate  estimate:',  counts2010 .mean ( ) 
print  '2013  accident  rate  estimate:',  counts2013 .mean ( ) 
print  'CI  for  2013:  '  , ci 

v _ y 


In  [8]  : 
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Out  [  8  ]  : 


2010  accident  rate  estimate:  24.8109 
2013  accident  rate  estimate:  25.9095 
Cl  for  2013:  [24.9751,  26.8440] 

Because  the  2010  accident  rate  estimate  does  not  fall  in  the  range  of  plausible 
values  of  2013,  we  say  the  alternative  hypothesis  cannot  be  discarded.  That  is,  it 
cannot  be  ruled  out  that  in  2013  the  mean  rate  of  traffic  accidents  in  Barcelona  was 
higher  than  in  2010. 

Interpreting  Cl  Tests 

Hypothesis  testing  is  built  around  rejecting  or  failing  to  reject  the  null  hypothesis. 
That  is,  we  do  not  reject  Ho  unless  we  have  strong  evidence  against  it.  But  what 
precisely  does  strong  evidence  mean?  As  a  general  rule  of  thumb,  for  those  cases 
where  the  null  hypothesis  is  actually  true,  we  do  not  want  to  incorrectly  reject  Ho 
more  than  5%  of  the  time.  This  corresponds  to  a  significance  level  of  a  =  0.05.  In 
this  case,  the  correct  interpretation  of  our  test  is  as  follows: 

If  we  use  a  95%  confidence  interval  to  test  a  problem  where  the  null  hypothesis  is  true,  we 
will  make  an  error  whenever  the  point  estimate  is  at  least  1.96  standard  errors  away  from  the 
population  parameter.  This  happens  about  5%  of  the  time  (2.5%  in  each  tail). 


4.4.2  Testing  Hypotheses  Using  /^-Values 

A  more  advanced  notion  of  statistical  significance  was  developed  by  R.A.  Fisher  in 
the  1920s  when  he  was  looking  for  a  test  to  decide  whether  variation  in  crop  yields 
was  due  to  some  specific  intervention  or  merely  random  factors  beyond  experimental 
control. 

Fisher  first  assumed  that  fertilizer  caused  no  difference  (null  hypothesis )  and  then 
calculated  P,  the  probability  that  an  observed  yield  in  a  fertilized  field  would  occur 
if  fertilizer  had  no  real  effect.  This  probability  is  called  the  p -value. 

The  p-value  is  the  probability  of  observing  data  at  least  as  favorable  to  the  alter¬ 
native  hypothesis  as  our  current  dataset,  if  the  null  hypothesis  is  true.  We  typically 
use  a  summary  statistic  of  the  data  to  help  compute  the  p -value  and  evaluate  the 
hypotheses. 

Usually,  if  P  is  less  than  0.05  (the  chance  of  a  fluke  is  less  than  5%)  the  result  is 
declared  statistically  significant. 

It  must  be  pointed  out  that  this  choice  is  rather  arbitrary  and  should  not  be  taken 
as  a  scientific  truth. 

The  goal  of  classical  hypothesis  testing  is  to  answer  the  question,  “ Given  a  sample 
and  an  apparent  effect,  what  is  the  probability  of  seeing  such  an  effect  by  chance  ?” 
Here  is  how  we  answer  that  question: 

•  The  first  step  is  to  quantify  the  size  of  the  apparent  effect  by  choosing  a  test  statistic. 
In  our  case,  the  apparent  effect  is  a  difference  in  accident  rates,  so  a  natural  choice 
for  the  test  statistic  is  the  difference  in  means  between  the  two  periods. 
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In  [9]  : 


Out  [  9  ]  : 


In  [10] : 


•  The  second  step  is  to  define  a  null  hypothesis ,  which  is  a  model  of  the  system 
based  on  the  assumption  that  the  apparent  effect  is  not  real.  In  our  case,  the  null 
hypothesis  is  that  there  is  no  difference  between  the  two  periods. 

•  The  third  step  is  to  compute  a  p -value,  which  is  the  probability  of  seeing  the 
apparent  effect  if  the  null  hypothesis  is  true.  In  our  case,  we  would  compute  the 
difference  in  means,  then  compute  the  probability  of  seeing  a  difference  as  big,  or 
bigger,  under  the  null  hypothesis. 

•  The  last  step  is  to  interpret  the  result.  If  the  p -value  is  low,  the  effect  is  said  to  be 
statistically  significant ,  which  means  that  it  is  unlikely  to  have  occurred  by  chance. 
In  this  case  we  infer  that  the  effect  is  more  likely  to  appear  in  the  larger  population. 

In  our  case,  the  test  statistic  can  be  easily  computed: 

/  \ 

m=  len ( count s2 0 1 0 ) 
n=  len ( counts2 0 13 ) 

p  =  ( counts2013 . mean ( )  -  counts2010 . mean ( ) ) 

print  '  m  :  '  ,  m,  ' n : ' ,  n 

print  'mean  difference:  ' ,  p 

V _ / 


m:  365  n:  365 

mean  difference:  1.0986 

To  approximate  the  p -value  ,  we  can  follow  the  following  procedure: 

1.  Pool  the  distributions,  generate  samples  with  size  n  and  compute  the  difference 
in  the  mean. 

2.  Generate  samples  with  size  n  and  compute  the  difference  in  the  mean. 

3.  Count  how  many  differences  are  larger  than  the  observed  one. 


(  \ 

#  pooling  distributions 
x  =  counts  2  010 
y  =  counts  2  013 

pool  =  np .  concatenate  (  [x,  y]) 

np . random. shuffle (pool) 

ttsample  generation 
import  random 

N  =  10000  #  number  of  samples 

di f  f  =  range  ( N ) 
for  i  in  range  (N)  : 

pi  =  [random,  choice  (pool  )  for  _  in  xrange  (n)  ] 
p2  =  [random,  choice  (pool  )  for  _  in  xrange  (n)  ] 
di  f  f  [  i  ]  =  (  np  .  mean  (  pi  )  -  np  .  mean  (  p2  )  ) 

V _ / 
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Out [ 11 ] : p-value  ( Simulation) =  0.0485  (  4.85%)  Difference  =  1.098 
The  effect  is  likely 

Interpreting  P -Values 

A  p -value  is  the  probability  of  an  observed  (or  more  extreme)  result  arising  only 
from  chance. 

If  P  is  less  than  0.05,  there  are  two  possible  conclusions:  there  is  a  real  effect  or 
the  result  is  an  improbable  fluke.  Fisher's  method  offers  no  way  of  knowing  which  is 
the  case. 

We  must  not  confuse  the  odds  of  getting  a  result  (if  a  hypothesis  is  true)  with 
the  odds  of  favoring  the  hypothesis  if  you  observe  that  result.  If  P  is  less  than  0.05, 
we  cannot  say  that  this  means  that  it  is  95%  certain  that  the  observed  effect  is  real 
and  could  not  have  arisen  by  chance.  Given  an  observation  E  and  a  hypothesis  H , 
P(E\H )  and  P(H\E )  are  not  the  same! 

Another  common  error  equates  statistical  significance  to  practical  importance/ 
relevance.  When  working  with  large  datasets,  we  can  detect  statistical  significance 
for  small  effects  that  are  meaningless  in  practical  terms. 

We  have  defined  the  effect  as  a  difference  in  mean  as  large  or  larger  than  8, 
considering  the  sign.  A  test  like  this  is  called  one  sided. 

If  the  relevant  question  is  whether  accident  rates  are  different ,  then  it  makes  sense 
to  test  the  absolute  difference  in  means.  This  kind  of  test  is  called  two  sided  because 
it  counts  both  sides  of  the  distribution  of  differences. 

Direct  Approach 

The  formula  for  the  standard  error  of  the  absolute  difference  in  two  means  is  similar 
to  the  formula  for  other  standard  errors.  Recall  that  the  standard  error  of  a  single 
mean  can  be  approximated  by: 


<*i 

sfn I 


The  standard  error  of  the  difference  of  two  sample  means  can  be  constructed  from 
the  standard  errors  of  the  separate  sample  means: 


SEXl—  x2 


n  2 


This  would  allow  us  to  define  a  direct  test  with  the  95%  confidence  interval. 
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4.5  But  Is  the  Effect  E  Real? 

We  do  not  yet  have  an  answer  for  this  question!  We  have  defined  a  null  hypothesis 
Hq  (the  effect  is  not  real)  and  we  have  computed  the  probability  of  the  observed 
effect  under  the  null  hypothesis,  which  is  P(E\Hq),  where  E  is  an  effect  as  big  as 
or  bigger  than  the  apparent  effect  and  a  p -value  . 

We  have  stated  that  from  the  frequentist  point  of  view,  we  cannot  consider  Ha 
unless  P(E\Ho)  is  less  than  an  arbitrary  value.  But  the  real  answer  to  this  question 
must  be  based  on  comparing  P(Ho\E)  to  P(Ha\E),  not  on  P(E\Ho)\  One  possi¬ 
ble  solution  to  these  problems  is  to  use  Bayesian  reasoning ;  an  alternative  to  the 
frequentist  approach. 

No  matter  how  many  data  you  have,  you  will  still  depend  on  intuition  to  decide 
how  to  interpret,  explain,  and  use  that  data.  Data  cannot  speak  by  themselves.  Data 
scientists  are  interpreters,  offering  one  interpretation  of  what  the  useful  narrative 
story  derived  from  the  data  is,  if  there  is  one  at  all. 


4.6  Conclusions 

In  this  chapter  we  have  seen  how  we  can  approach  the  problem  of  making  probable 
propositions  regarding  population  parameters. 

We  have  learned  that  in  some  cases,  there  are  theoretical  results  that  allow  us  to 
compute  a  measure  of  the  variability  of  our  estimates.  We  have  called  this  approach 
the  “traditional  approach”.  Within  this  framework,  we  have  seen  that  the  sampling 
distribution  of  our  parameter  of  interest  is  the  most  important  concept  when  under¬ 
standing  the  real  meaning  of  propositions  concerning  parameters. 

We  have  also  learned  that  the  traditional  approach  is  not  the  only  alternative.  The 
“computationally  intensive  approach”,  based  on  the  bootstrap  method,  is  a  relatively 
new  approach  that,  based  on  intensive  computer  simulations,  is  capable  of  computing 
a  measure  of  the  variability  of  our  estimates  by  applying  a  resampling  method  to 
our  data  sample.  Bootstrapping  can  be  used  for  computing  variability  of  almost  any 
function  of  our  data,  with  its  only  downside  being  the  need  for  greater  computational 
resources. 

We  have  seen  that  propositions  about  parameters  can  be  classified  into  three 
classes:  propositions  about  point  estimates,  propositions  about  set  estimates,  and 
propositions  about  the  acceptance  or  the  rejection  of  a  hypothesis.  All  these  classes 
are  related;  but  today,  set  estimates  and  hypothesis  testing  are  the  most  preferred. 
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Finally,  we  have  shown  that  the  production  of  probable  propositions  is  not  error 
free,  even  in  the  presence  of  big  data.  For  these  reason,  data  scientists  cannot  forget 
that  after  any  inference  task,  they  must  take  decisions  regarding  the  final  interpretation 
of  the  data. 
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References 

1.  M.I.  Jordan.  Are  you  a  Bayesian  or  a  frequentist?  [Video  Lecture].  Published:  Nov.  2,  2009, 
Recorded:  September  2009.  Retrieved  from:  http://videolectures.net/mlss09uk_jordan_bfway/ 

2.  B.  Efron,  R.J.  Tibshirani,  An  introduction  to  the  bootstrap  (CRC  press,  1994) 


Supervised  Learning 


5.1  Introduction 

Machine  learning  involves  coding  programs  that  automatically  adjust  their  perfor¬ 
mance  in  accordance  with  their  exposure  to  information  in  data.  This  learning  is 
achieved  via  a  parameterized  model  with  tunable  parameters  that  are  automatically 
adjusted  according  to  different  performance  criteria.  Machine  learning  can  be  con¬ 
sidered  a  subfield  of  artificial  intelligence  (AI)  and  we  can  roughly  divide  the  field 
into  the  following  three  major  classes. 

1.  Supervised  learning:  Algorithms  which  learn  from  a  training  set  of  labeled 
examples  (exemplars)  to  generalize  to  the  set  of  all  possible  inputs.  Examples  of 
techniques  in  supervised  learning:  logistic  regression,  support  vector  machines, 
decision  trees,  random  forest,  etc. 

2.  Unsupervised  learning:  Algorithms  that  learn  from  a  training  set  of  unlabeled 
examples.  Used  to  explore  data  according  to  some  statistical,  geometric  or  sim¬ 
ilarity  criterion.  Examples  of  unsupervised  learning  include  k-means  clustering 
and  kernel  density  estimation.  We  will  see  more  on  this  kind  of  techniques  in 
Chap.  7. 

3.  Reinforcement  learning:  Algorithms  that  learn  via  reinforcement  from  criticism 
that  provides  information  on  the  quality  of  a  solution,  but  not  on  how  to  improve 
it.  Improved  solutions  are  achieved  by  iteratively  exploring  the  solution  space. 

This  chapter  focuses  on  a  particular  class  of  supervised  machine  learning:  clas¬ 
sification .  As  a  data  scientist,  the  first  step  you  apply  given  a  certain  problem  is  to 
identify  the  question  to  be  answered.  According  to  the  type  of  answer  we  are  seeking, 
we  are  directly  aiming  for  a  certain  set  of  techniques. 
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•  If  our  question  is  answered  by  YES/NO,  we  are  facing  a  classification  problem. 

Classifiers  are  also  the  tools  to  use  if  our  question  admits  only  a  discrete  set  of 

answers,  i.e.,  we  want  to  select  from  a  finite  number  of  choices. 

-  Given  the  results  of  a  clinical  test,  e.g.,  does  this  patient  suffer  from  diabetes? 

-  Given  a  magnetic  resonance  image,  is  it  a  tumor  shown  in  the  image? 

-  Given  the  past  activity  associated  with  a  credit  card,  is  the  current  operation 
fraudulent? 

•  If  our  question  is  a  prediction  of  a  real- valued  quantity,  we  are  faced  with  a  regres¬ 
sion  problem.  We  will  go  into  details  of  regression  in  Chap.  6. 

-  Given  the  description  of  an  apartment,  what  is  the  expected  market  value  of  the 
flat?  What  will  the  value  be  if  the  apartment  has  an  elevator? 

-  Given  the  past  records  of  user  activity  on  Apps,  how  long  will  a  certain  client 
be  connected  to  our  App? 

-  Given  my  skills  and  marks  in  computer  science  and  maths,  what  mark  will  I 
achieve  in  a  data  science  course? 

Observe  that  some  problems  can  be  solved  using  both  regression  and  classification. 
As  we  will  see  later,  many  classification  algorithms  are  thresholded  regressors.  There 
is  a  certain  skill  involved  in  designing  the  correct  question  and  this  dramatically 
affects  the  solution  we  obtain. 


5.2  The  Problem 

In  this  chapter  we  use  data  from  the  Lending  Club  to  develop  our  understanding  of 
machine  learning  concepts.  The  Lending  Club  is  a  peer-to-peer  lending  company. 
It  offers  loans  which  are  funded  by  other  people.  In  this  sense,  the  Lending  Club 
acts  as  a  hub  connecting  borrowers  with  investors.  The  client  applies  for  a  loan  of  a 
certain  amount,  and  the  company  assesses  the  risk  of  the  operation.  If  the  application 
is  accepted,  it  may  or  may  not  be  fully  covered.  We  will  focus  on  the  prediction 
of  whether  the  loan  will  be  fully  funded,  based  on  the  scoring  of  and  information 
related  to  the  application. 

We  will  use  the  partial  dataset  of  period  2007-2011.  Lraming  the  problem  a  little 
bit  more,  based  on  the  information  supplied  by  the  customer  asking  for  a  loan,  we 
want  to  predict  whether  it  will  be  granted  up  to  a  certain  threshold  thr .  The  attributes 
we  use  in  this  problem  are  related  to  some  of  the  details  of  the  loan  application,  such 
as  amount  of  the  loan  applied  for  the  borrower,  monthly  payment  to  be  made  by 
the  borrower  if  the  loan  is  accepted,  the  borrower’s  annual  income,  the  number  of 


1  https://www.lendingclub.com/info/download-data.action. 
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In  [  1 


incidences  of  delinquency  in  the  borrower’s  credit  file,  and  interest  rate  of  the  loan, 
among  others. 

In  this  case  we  would  like  to  predict  unsuccessful  accepted  loans.  A  loan  applica¬ 
tion  is  unsuccessful  if  the  funded  amount  (funded_amnt)  or  the  amount  funded 
by  investors  (funded_amnt_inv)  falls  far  short  of  the  requested  loan  amount 
(loan_amnt).  That  is, 


loan  —  funded 
loan 


>  0.95. 


5.3  First  Steps 


Note  that  in  this  problem  we  are  predicting  a  binary  value:  either  the  loan  is  fully 
funded  or  not.  Classification  is  the  natural  choice  of  machine  learning  tools  for 
prediction  with  discrete  known  outcomes.  According  to  the  cardinality  of  the  target 
set,  one  usually  distinguishes  between  binary  classifiers  when  the  target  output  only 
takes  two  values,  i.e.,  the  classifier  answers  questions  with  a  yes  or  a  no;  or  multiclass 
classifiers,  for  a  larger  number  of  classes.  This  issue  is  important  in  that  not  all 
methods  can  naturally  handle  the  multiclass  setting. 

In  a  formal  way,  classification  is  regarded  as  the  problem  of  finding  a  function 
h  (x)  :  Wl  ->  IK  that  maps  an  input  space  in  Wl  onto  a  discrete  set  of  k  target  outputs 
or  classes  K  =  {1,  In  this  setting,  the  features  are  arranged  as  a  vector  x  of 

d  real- valued  numbers. 


We  can  encode  both  target  states  in  a  numerical  variable,  e.g.,  a  successful  loan 
target  can  take  value  +1;  and  it  is  —1,  otherwise. 

Let  us  check  the  dataset,4 


import 

pickle 

\ 

o  f  name 

=  open ( ' . /files / ch05 / dataset_small . pkl ' , ' rb ' ) 

#  x  stores  input  data  and  y  target  values 

(  x  ,  y  ) 

=  p i c k 1 e . 1 oad ( o f name ) 

V 

V 

2 Several  well-known  techniques  such  as  support  vector  machines  or  adaptive  boosting  (adaboost) 
are  originally  defined  in  the  binary  case.  Any  binary  classifier  can  be  extended  to  the  multiclass  case 
in  two  different  ways.  We  may  either  change  the  formulation  of  the  learning/optimization  process. 
This  requires  the  derivation  of  a  new  learning  algorithm  capable  of  handling  the  new  modeling. 
Alternatively,  we  may  adopt  ensemble  techniques.  The  idea  behind  this  latter  approach  is  that  we 
may  divide  the  multiclass  problem  into  several  binary  problems;  solve  them;  and  then  aggregate  the 
results.  If  the  reader  is  interested  in  these  techniques,  it  is  a  good  idea  to  look  for:  one-versus-all, 
one-versus-one,  or  error  correcting  output  codes  methods. 

3 Many  problems  are  described  using  categorical  data.  In  these  cases  either  we  need  classifiers  that 
are  capable  of  coping  with  this  kind  of  data  or  we  need  to  change  the  representation  of  those  variables 
into  numerical  values. 

4The  notebook  companion  shows  the  preprocessing  steps,  from  reading  the  dataset,  cleaning  and 
imputing  data,  up  to  saving  a  subsampled  clean  version  of  the  original  dataset. 
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A  problem  in  Scikit-learn  is  modeled  as  follows: 

•  Input  data  is  structured  in  Numpy  arrays.  The  size  of  the  array  is  expected  to  be 
[n_samples,  n_features]: 

-  n_samples:  The  number  of  samples  ( n ).  Each  sample  is  an  item  to  process 
(e.g.,  classify).  A  sample  can  be  a  document,  a  picture,  an  audio  file,  a  video, 
an  astronomical  object,  a  row  in  a  database  or  CSV  file,  or  whatever  you  can 
describe  with  a  fixed  set  of  quantitative  traits. 

-  n_features:  The  number  of  features  id)  or  distinct  traits  that  can  be  used  to 
describe  each  item  in  a  quantitative  manner.  Features  are  generally  real- valued, 
but  may  be  Boolean,  discrete- valued  or  even  categorical. 


feature  matrix  :  X  = 


*11  X\2  •  •  •  X\d 
X21  X22  •  •  •  *2 d 
*31  *32  •  •  •  *3 d 


Xfl  1  Xn2  '  •  •  Xnd 


label  vector:  yT  =  [y\,  y2,  y$,  ■  ■  ■  yn] 


The  number  of  features  must  be  fixed  in  advance.  However,  it  can  be  very  great 


(e.g.,  millions  of  features). 

In  [2]  : 

f 

dims  =  x.  shape  [1] 

N  =  x .  shape  [  0 ] 

print  'dims:  '  +  str(dims)  +  samples:  '  +  str(N) 

\ 

S. _ 

_ y 

Out[2]:  dims:  15,  samples:  4140 

Considering  data  arranged  as  in  the  previous  matrices  we  refer  to: 

•  the  columns  as  features,  attributes,  dimensions,  regressors,  covariates,  predictors, 
or  independent  variables; 

•  the  rows  as  instances,  examples,  or  samples; 

•  the  target  as  the  label,  outcome,  response,  or  dependent  variable. 

All  objects  in  Scikit-learn  share  a  uniform  and  limited  API  consisting  of  three 
complementary  interfaces: 

•  an  estimator  interface  for  building  and  fitting  models  (f  it  ( ) ); 

•  a  predictor  interface  for  making  predictions  (predictO); 

•  a  transformer  interface  for  converting  data  (transformO). 


5.3  First  Steps 
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In  [3]  : 


Out  [  3  ]  : 


In  [4]  : 


Out  [  4  ]  : 


In  [  5  ]  : 


Let  us  apply  a  classifier  using  Python’s  Scikit-learn  libraries, 

f  \ 

from  sklearn  import  neighbors 
from  sklearn  import  datasets 

#  Create  an  instance  of  K-nearest  neighbor  classifier 

knn  =  neighbors . KNeighborsClassif ier (n_neighbors  =  11) 

#  Train  the  classifier 
knn  .  f  i  t  (  x  ,  y  ) 

#  Compute  the  prediction  according  to  the  model 
yhat  =  knn. predict (x) 

#  Check  the  result  on  the  last  example 

print  'Predicted  value:  '  +  str ( yha  t  [ - 1 ]  )  , 

real  target:  '  +  str  (y [-1]  ) 

V _ / 


Predicted  value:  -1.0  ,  real  target:  -1.0 

The  basic  measure  of  performance  of  a  classifier  is  its  accuracy.  This  is  defined  as 
the  number  of  correctly  predicted  examples  divided  by  the  total  amount  of  examples. 
Accuracy  is  related  to  the  error  as  follows:  acc  =  1  —  err. 


Number  of  correct  predictions 
acc  =  - 


n 

Each  estimator  has  a  score  ( )  method  that  invokes  the  default  scoring  metric. 
In  the  case  of  k-nearest  neighbors,  this  is  the  classification  accuracy. 


/  \ 

knn . score ( x , y ) 

V _ / 


0.83164251207729467 

It  looks  like  a  really  good  result.  But  how  good  is  it?  Let  us  first  understand  a  little 
bit  more  about  the  problem  by  checking  the  distribution  of  the  labels. 

Let  us  load  the  dataset  and  check  the  distribution  of  labels: 

/  \ 

pit  .pie  (  np  .  c_  [  np  .  sum  (  np  .  where  (y  ==  1,  1,  0)), 

np  .  sum  (  np  .  where  (y  ==  -1,  1,  0  )  )  ]  [  0  ]  , 

labels  =  ['Not  fully  funded', 'Full  amount'], 
colors  =  [  ' r '  ,  'g'],  shadow  =  False, 

au  t  opc  t  =  '  % . 2  f '  ) 

pit  .  gcf  ()  .  set_size_inch.es  (  (7  ,  7)  ) 

V _ / 


with  the  result  observed  in  big.  5.1. 

Note  that  there  are  far  more  positive  labels  than  negative  ones.  In  this  case,  the 
dataset  is  referred  to  as  unbalanced.  This  has  important  consequences  for  a  classifier 
as  we  will  see  later  on.  In  particular,  a  very  simple  rule  such  as  always  predict  the 


5  The  term  unbalanced  describes  the  condition  of  data  where  the  ratio  between  positives  and  negatives 
is  a  small  value.  In  these  scenarios,  always  predicting  the  majority  class  usually  yields  accurate 
performance,  though  it  is  not  very  informative.  This  kind  of  problems  is  very  common  when  we 
want  to  model  unusual  events  such  as  rare  diseases,  the  occurrence  of  a  failure  in  machinery, 
fraudulent  credit  card  operations,  etc.  In  these  scenarios,  gathering  data  from  usual  events  is  very 
easy  but  collecting  data  from  unusual  events  is  difficult  and  results  in  a  comparatively  small  dataset. 
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Fig.  5.1  Pie  chart  showing 
the  distribution  of  labels  in 
the  dataset 


majority  class,  will  give  us  good  performance.  In  our  problem,  always  predicting 
that  the  loan  will  be  fully  funded  correctly  predicts  81.57%  of  the  samples.  Observe 
that  this  value  is  very  close  to  that  obtained  using  the  classifier. 

Although  accuracy  is  the  most  normal  metric  for  evaluating  classifiers,  there  are 
cases  when  the  business  value  of  correctly  predicting  elements  from  one  class  is 
different  from  the  value  for  the  prediction  of  elements  of  another  class.  In  those 
cases,  accuracy  is  not  a  good  performance  metric  and  more  detailed  analysis  is 
needed.  The  confusion  matrix  enables  us  to  define  different  metrics  considering  such 
scenarios.  The  confusion  matrix  considers  the  concepts  of  the  classifier  outcome  and 
the  actual  ground  truth  or  gold  standard.  In  a  binary  problem,  there  are  four  possible 
cases: 

•  True  positives  (TP):  When  the  classifier  predicts  a  sample  as  positive  and  it  really 
is  positive. 

•  False  positives  ( FP ):  When  the  classifier  predicts  a  sample  as  positive  but  in  fact 
it  is  negative. 

•  True  negatives  (TN):  When  the  classifier  predicts  a  sample  as  negative  and  it  really 
is  negative. 

•  False  negatives  ( FN ):  When  the  classifier  predicts  a  sample  as  negative  but  in  fact 
it  is  positive. 

We  can  summarize  this  information  in  a  matrix,  namely  the  confusion  matrix,  as 
follows: 


5.3  First  Steps 
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Gold  Standard 


Positive 

Negative 

Positive 

TP 

r  fp  i- 

->  Precision 

Prediction  Negative 

FN 

t  TN  r 

->  Negative  Predictive  Value 

4-  4 

Sensitivity  Specificity 
(Recall) 


The  combination  of  these  elements  allows  us  to  define  several  performance  metrics: 


Accuracy: 


accuracy  = 


TP  +  TN 


TP  +  TN  +  FP  +  FN 
Column- wise  we  find  these  two  partial  performance  metrics: 


Sensitivity  or  Recall: 

sensitivity  = 

Specificity: 

specificity  = 


TP 


TP 


Real  Positives 


TN 


TP  +  FN 
TN 


Real  Negatives  TN  +  FP 


Row- wise  we  find  these  two  partial  performance  metrics: 


TP 

~~  TP  +  FP 
TN 

TN  +  FN 

These  partial  performance  metrics  allow  us  to  answer  questions  concerning  how 
often  a  classifier  predicts  a  particular  class,  e.g.,  what  is  the  rate  of  predictions  for 
not  fully  funded  loans  that  have  actually  not  been  fully  funded?  This  question  is 
answered  by  recall.  In  contrast,  we  could  ask:  Of  all  the  fully  funded  loans  predicted 
by  the  classifier,  how  many  have  been  fully  funded?  This  is  answered  by  the  precision 
metric. 

Let  us  compute  these  metrics  for  our  problem. 


-  Precision  or  Positive  Predictive  Value: 

TP 

precision  =  - 

Predicted  Positives 

-  Negative  predictive  value: 

TN 

NPV  = - = 

Predicted  Negative 
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In  [  6  ]  : 


Out  [  6  ]  : 


In  [7]  : 


Out  [  7  ]  : 


In  [8]  : 


Out  [  8  ]  : 


f  \ 


yha  t 

— 

knn  . 

pr  e 

diet 

(x) 

TP  = 

n  o 

.  sum 

(  np 

.  log 

i  c  a  1_ 

.and 

(  yha t 

—  —  _ 

1 

/  y  =z 

=  -i)  ) 

TN  = 

np 

.  sum 

(  np 

.  log 

ical. 

.and 

(  yha  t 

=  =  1 

/ 

y  =  = 

l)  ) 

FP  = 

np 

.  sum 

(  np 

.  log 

ical. 

.and 

(  yha  t 

—  —  _ 

1 

/  y  =  = 

=  i)  ) 

FN  = 

np 

.  sum 

(  np 

.  log 

ical. 

.and 

(  yha t 

=  =  1 

/ 

y  =  = 

-l)  ) 

print 

/ 

TP  : 

'  + 

str  ( 

TP)  , 

/ 

/ 

FP  :  '  + 

str 

( 

FP  ) 

print 

t 

FN  : 

'  + 

str  ( 

FN)  , 

/ 

/ 

TN  :  '  + 

str 

( 

TN  ) 

\ _ 7 


TP:  3370  ,  FP:  690 
FN:  7  ,  TN:  73 

Scikit-learn  provides  us  with  the  confusion  matrix, 


/ 

\ 

from  sklearn  import  metrics 

metrics . c on f u s i on_ma t r ix (yhat , 

Y) 

#  sklearn  uses  a  transposed  convention  for  the  confusion 

v 

#  matrix  thus  I  change  targets 

and  predictions 

_ / 

3370,  690 
7,  73 


Let  us  check  the  following  example.  Let  us  select  a  nearest  neighbor  classifier 
with  the  number  of  neighbors  equal  to  one  instead  of  eleven,  as  we  did  before,  and 
check  the  training  error. 


classification  accuracy:  1.0  confusion  matrix: 

3377  0 
0  763 

The  performance  measure  is  perfect!  100%  accuracy  and  a  diagonal  confusion 
matrix!  This  looks  good.  However,  up  to  this  point  we  have  checked  the  classifier 
performance  on  the  same  data  it  has  been  trained  with.  During  exploitation,  in  real 
applications,  we  will  use  the  classifier  on  data  not  previously  seen.  Let  us  simulate 
this  effect  by  splitting  the  data  into  two  sets:  one  will  be  used  for  learning  ( training 
set)  and  the  other  for  testing  the  accuracy  ( test  set). 


5.3  First  Steps 
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In  [  9  ]  : 


#  Simulate  a  real  case:  Randomize  and  split  data  into 

#  two  subsets  PRC*100\%  for  training  and  the  rest 

#  ( 1  -  PRC )  *  1 0 0 \ %  for  testing 

perm  =  np . random . permu tat i on ( y . s i z e ) 

PRC  =  0.7 

sp 1 i t_po i n  t  =  in t  ( np .ceil  (y.  shape  [0]  *  PRC  )  ) 


X_t  r a i n 
y_t  r a i n 

X_test 

y_test 


x  [perm  [  :  spli t_po i n t ]  .ravel  ()  ,  :] 
y [perm  [  :  spli t_po i n t ]  .ravel  (  )  ] 

x [perm [ spli t_p  oint  :  ]  .  ravel  (  )  ,:] 
y [perm [ spli t_p  oint:].  ravel  ()] 


If  we  check  the  shapes  of  the  training  and  test  sets  we  obtain, 


Out [9] 


In  [10] 


Training  shape:  (2898,  15),  training  targets  shape:  (2898,) 
Testing  shape:  (1242,  15),  testing  targets  shape:  (1242,) 

With  this  new  partition,  let  us  train  the  model 


#  Train 

a  classifier  on  training  data 

\ 

knn  = 

neighbors . KNeighborsClassif ier (n_neighbors  =  1) 

knn . fit (X_train ,  y_train) 

yha  t  = 

knn. predict (X_train) 

print 

" \ n  TRAINING  STATS:" 

print 

"classification  accuracy: "  + 

str (metrics . ac cur acy_s core ( yhat ,  y_train ) ) 

print 

"confusion  matrix:  \n"  + 

str  (metrics  .  confusi on_ma  t  r i x  ( y_t r a i n  ,  yhat )  ) 

V 

y 

Out [10] : TRAINING  STATS: 

classification  accuracy:  1.0 

confusion  matrix: 

2355  0 

0  543 


As  expected  from  the  former  experiment,  we  achieve  a  perfect  score.  Now  let  us 
see  what  happens  in  the  simulation  with  previously  unseen  data. 


N 

#  Check 

on  the  test  set 

yhat  = 

knn . predict ( X_tes t ) 

print 

" TESTING  STATS  :  " 

print 

" classification  accuracy :  "  , 
metrics . a c cur a cy_s c o r e (yhat , 

y_t  e s  t ) 

print 

"confusion  matrix:  \n"  + 

str  (metrics  .  confusi on_ma  t  r i x 

( yhat  ,  y_t  e  s  t )  ) 

V 

_ y 

Out [11]  :  TESTING  STATS: 

classification  accuracy:  0.754428341385 

confusion  matrix: 

865  148 

157  72 
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Observe  that  each  time  we  run  the  process  of  randomly  splitting  the  dataset  and 
train  a  classifier  we  obtain  a  different  performance.  A  good  simulation  for  approxi¬ 
mating  the  test  error  is  to  run  this  process  many  times  and  average  the  performances. 
Let  us  do  this! 


Out [ 12 ]:  Mean  expected  error:  0.754669887279 

As  we  can  see,  the  resulting  error  is  below  81%,  which  was  the  result  of  the  most 
naive  decision  process.  What  is  wrong  with  this  result? 

Let  us  introduce  the  nomenclature  for  the  quantities  we  have  just  computed  and 
define  the  following  terms. 


•  In-sample  error  E-m\  The  in-sample  error  or  training  error  is  the  error  measured 
over  all  the  observed  data  samples  in  the  training  set,  i.e., 

1  N 

Ein  =  —  L/fe,  yri 
i= 1 

•  Out-of-sample  error  Eout:  The  out-of-sample  error  or  generalization  error  mea¬ 
sures  the  expected  error  on  unseen  data.  We  can  approximate/simulate  this  quantity 
by  holding  back  some  training  data  for  testing  purposes. 

£out  —  y  (&(x  j  }0) 

Note  that  the  definition  of  the  instantaneous  error  e(x;,  yt)  is  still  missing.  For 
example,  in  classification  we  could  use  the  indicator  function  to  account  for  a  cor¬ 
rectly  classified  sample  as  follows: 


e(*i,  yt )  =  I[h(xi)  =  yt ] 


1,  if  h(xi)  =  yt 
0  otherwise. 


6sklearn  allows  us  to  easily  automate  the  train/test  splitting  using  the  function 
train_test_split (...). 


5.3  First  Steps 
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Fig.  5.2  Comparison  of  the  methods  using  the  accuracy  metric 
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Observe  that: 


Using  the  expected  error  on  the  test  set,  we  can  select  the  best  classifier  for 
our  application.  This  is  called  model  selection.  In  this  example  we  cover  the  most 
simplistic  setting.  Suppose  we  have  a  set  of  different  classifiers  and  want  to  select 
the  “best”  one.  We  may  use  the  one  that  yields  the  lowest  error  rate. 


In 


[13]  : 


from  sklearn  import  tree 
from  sklearn  import  svm 
PRC  =  0.1 

ac c_r  =  np .  zeros  (  (10  ,  4)) 

for  i  in  xrange  (10)  : 

X_train ,  X_test,  y_train ,  y_test  = 

t r a i n_t e s t_sp 1 i t  (  x  ,  y,  test_size  =  PRC) 
nnl  =  neighbors . KNeighborsClass i f ier ( n_neighbors  =  1) 

nn3  =  neighbors . KNeighborsClassif ier ( n_neighbors  =  3) 

sve  =  svm . SVC  (  ) 

dt  =  tree . DecisionTreeClassif ier  (  ) 


nnl . fit (X_train ,  y_train) 
nn3 . fit (X_train ,  y_train) 
sve  .  fit  (X_train  ,  y_train) 
dt  .  f i t  ( X_t rain  ,  y_train) 


yhat_nnl  =  nnl . predict (X_test ) 
yhat_nn3  =  nn3 . predict (X_test ) 
yhat_svc  =  sve . predict (X_test ) 
yhat_dt  =  dt . predict (X_test ) 


acc_r  [ i ]  [  0 ] 
acc_r  [  i ]  [  1 ] 
acc_r  [  i  ]  [2] 
acc_r  [  i  ]  [  3  ] 


metrics  .  a c c u r a cy_s c o r e  (yhat_nnl 
metrics  .  accuracy_score  (yhat_nn3 
metrics  .  accuracy_score  (yhat_svc 
metrics  .  a c c u r a cy_s c o r e  (yhat_dt  , 


y_t es  t ) 
y_t e s  t ) 
y_t e s  t ) 
y_test ) 


Figure  5.2  shows  the  results  of  applying  the  code. 
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This  process  is  one  particular  form  of  a  general  model  selection  technique  called 
cross-validation.  There  are  other  kinds  of  cross-validation,  such  as  leave-one-out  or 
K-fold  cross-validation. 

•  In  leave-one-out,  given  N  samples,  the  model  is  trained  with  TV  —  1  samples  and 
tested  with  the  remaining  one.  This  is  repeated  N  times,  once  per  training  sample 
and  the  result  is  averaged. 

•  In  K-fold  cross-validation,  the  training  set  is  divided  into  K  nonoverlapping  splits. 
K-l  splits  are  used  for  training  and  the  remaining  one  used  to  assess  the  mean. 
This  process  is  repeated  K  times  leaving  one  split  out  each  time.  The  results  are 
then  averaged. 


5.4  What  Is  Learning? 

Let  us  recall  the  two  basic  values  defined  in  the  last  section.  We  talk  of  training  error 
or  in-sample  error,  Em ,  which  refers  to  the  error  measured  over  all  the  observed  data 
samples  in  the  training  set.  We  also  talk  of  test  error  or  generalization  error ,  Eout, 
as  the  error  expected  on  unseen  data. 

We  can  empirically  estimate  the  generalization  error  by  means  of  cross-validation 
techniques  and  observe  that: 


£out  ^  ^in- 

The  goal  of  learning  is  to  minimize  the  generalization  error;  but  how  can  we 
guarantee  this  minimization  using  only  training  data? 

From  the  above  inequality  it  is  easy  to  derive  a  couple  of  very  intuitive  ideas. 

•  Because  Eout  is  greater  than  or  equal  to  Em,  it  is  desirable  to  have 

£in  ->  0. 

•  Additionally,  we  also  want  the  training  error  behavior  to  track  the  generalization 
error  so  that  if  one  minimizes  the  in-sample  error  the  out-of-sample  error  follows, 
i.e., 

£out  ~  ^in- 

We  can  rewrite  the  second  condition  as 

^in  25:  ^out  ^  ^in  T  E2 , 

with  Q  —>  0. 

We  would  like  to  characterize  El  in  terms  of  our  problem  parameters,  i.e.,  the 
number  of  samples  (A),  dimensionality  of  the  problem  (d),  etc. 

Statistical  analysis  offers  an  interesting  characterization  of  this  quantity 


7  The  reader  should  note  that  there  are  several  bounds  in  machine  learning  to  characterize  the 
generalization  error.  Most  of  them  come  from  variations  of  Hoeff ding’s  inequality. 


5.4  What  Is  Learning? 
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Fig.  5.3  Toy  problem  data  12 

10 


—4 

■ 

-  -4  -2  0  2  4  6  8  10  12  14 


). 


where  C  is  a  measure  of  the  complexity  of  the  model  class  we  are  using.  Technically, 
we  may  also  refer  to  this  model  class  as  the  hypothesis  space. 


5.5  Learning  Curves 

Let  us  simulate  the  effect  of  the  number  of  examples  on  the  training  and  test  errors 
for  a  given  complexity.  This  curve  is  called  the  learning  curve.  We  will  focus  for  a 
moment  in  a  more  simple  case.  Consider  the  toy  problem  in  Fig.  5.3. 

Let  us  take  a  classifier  and  vary  the  number  of  examples  we  feed  it  for  training 
purposes,  then  check  the  behavior  of  the  training  and  test  accuracies  as  the  number 
of  examples  grows.  In  this  particular  case,  we  will  be  using  a  decision  tree  with  fixed 
maximum  depth. 

Observing  the  plot  in  Fig.  5.4,  we  can  see  that: 

•  As  the  number  of  training  samples  increases,  both  errors  tend  to  the  same  value. 

•  When  we  have  few  training  data,  the  training  error  is  very  small  but  the  test  error 
is  very  large. 

Now  check  the  learning  curve  when  the  degree  of  complexity  is  greater  in  Fig.  5.5. 
We  simulate  this  effect  by  increasing  the  maximum  depth  of  the  tree. 

And  if  we  put  both  curves  together,  we  have  the  results  shown  in  Fig.  5.6. 
Although  both  show  similar  behavior,  we  can  note  several  differences: 
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Fig.  5.4  Learning  curves  (training  and  test  errors)  for  a  model  with  a  high  degree  of  complexity 
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Fig.  5.7  Learning  curves  (training  and  test  errors)  for  a  fixed  number  of  data  samples,  as  the 
complexity  of  the  decision  tree  increases 


•  With  a  low  degree  of  complexity,  the  training  and  test  errors  converge  to  the  bias 
sooner/with  fewer  data. 

•  Moreover,  with  a  low  degree  of  complexity,  the  error  of  convergence  is  larger  than 
with  increased  complexity. 

The  value  both  errors  converge  towards  is  also  called  the  bias ;  and  the  differ¬ 
ence  between  this  value  and  the  test  error  is  called  the  variance.  The  bias/variance 
decomposition  of  the  learning  curve  is  an  alternative  approach  to  the  training  and 
generalization  view. 

Let  us  now  plot  the  learning  behavior  for  a  fixed  number  of  examples  with  respect 
to  the  complexity  of  the  model.  We  may  use  the  same  data  but  now  we  will  change 
the  maximum  depth  of  the  decision  tree,  which  governs  the  complexity  of  the  model. 

Observe  in  Fig.  5.7  that  as  the  complexity  increases  the  training  error  is  reduced; 
but  above  a  certain  level  of  complexity,  the  test  error  also  increases.  This  effect  is 
called  overfitting.  We  may  enact  several  cures  for  overfitting: 

•  Observe  that  models  are  usually  parameterized  by  some  hyperparameters.  Select¬ 
ing  the  complexity  is  usually  governed  by  some  such  parameters.  Thus,  we  are 
faced  with  a  model  selection  problem.  A  good  heuristic  for  selecting  the  model  is 
to  choose  the  value  of  the  hyperparameters  that  yields  the  smallest  estimated  test 
error.  Remember  that  this  can  be  done  using  cross-validation. 

•  We  may  also  change  the  formulation  of  the  objective  function  to  penalize  complex 
models.  This  is  called  regularization.  Regularization  accounts  for  estimating  the 
value  of  Q  in  our  out-of-sample  error  inequality.  In  other  words,  it  models  the 
complexity  of  the  technique.  This  usually  becomes  implicit  in  the  algorithm  but 
has  huge  consequences  in  real  applications.  The  most  common  regularization 
strategies  are  as  follows: 
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-  L2  weight  regularization:  Adding  an  L2  penalization  term  to  the  weights  of  a 
weight-controlled  model  implies  looking  for  solutions  with  small  weight  values. 
Intuitively,  adding  an  L2  penalization  term  can  be  seen  as  a  surrogate  for  the 
notion  of  smoothness.  In  this  sense,  a  low  complexity  model  means  a  very 
smooth  model. 

-  LI  weight  regularization:  Adding  an  LI  regularization  term  forces  sparsity  in 
the  weights  of  the  model.  In  this  sense,  a  low  complexity  model  means  a  model 
with  few  components  or  few  active  terms. 

These  terms  are  added  to  the  objective  function.  They  trade  off  with  the  error 
function  in  the  objective  and  are  governed  by  a  hyperparameter.  Thus,  we  still 
have  to  select  this  parameter  by  means  of  model  selection. 

•  We  can  use  “ensemble  techniques”.  A  third  cure  for  overfitting  is  to  use  ensemble 
techniques.  The  best  known  are  bagging  and  boosting. 


5.6  Training,  Validation  and  Test 

Going  back  to  our  problem,  we  have  to  select  a  model  and  control  its  complexity 
according  to  the  number  of  training  data.  In  order  to  do  this,  we  can  start  by  using 
a  model  selection  technique.  We  have  seen  model  selection  before  when  we  wanted 
to  compare  the  performance  of  different  classifiers.  In  that  case,  our  best  bet  was  to 
select  the  classifier  with  the  smallest  Zs0ut-  Analogous  to  model  selection,  we  may 
think  of  selecting  the  best  hyperparameters  as  choosing  the  classifier  with  parameters 
that  performs  the  best.  Thus,  we  may  select  a  set  of  hyperparameter  values  and  use 
cross-validation  to  select  the  best  configuration. 

The  process  of  selecting  the  best  hyperparameters  is  called  validation.  This  intro¬ 
duces  a  new  set  into  our  simulation  scheme;  we  now  need  to  divide  the  data  we  have 
into  three  sets:  training,  validation,  and  test  sets.  As  we  have  seen,  the  process  of 
assessing  the  performance  of  the  classifier  by  estimating  the  generalization  error  is 
called  testing.  And  the  process  of  selecting  a  model  using  the  estimation  of  the  gen¬ 
eralization  error  is  called  validation.  There  is  a  subtle  but  critical  difference  between 
the  two  and  we  have  to  be  aware  of  it  when  dealing  with  our  problem. 

•  Test  data  is  used  exclusively  for  assessing  performance  at  the  end  of  the  process 
and  will  never  be  used  in  the  learning  process. 

•  Validation  data  is  used  explicitly  to  select  the  parameters/models  with  the  best 
performance  according  to  an  estimation  of  the  generalization  error.  This  is  a  form 
of  learning. 

•  Training  data  are  used  to  learn  the  instance  of  the  model  from  a  model  class. 


8This  set  cannot  be  used  to  select  a  classifier,  model  or  hyperparameter;  nor  can  it  be  used  in  any 
decision  process. 
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In  practice,  we  are  just  given  training  data,  and  in  the  most  general  case  we 
explicitly  have  to  tune  some  hyperparameter.  Thus,  how  do  we  select  the  different 
splits? 

How  we  do  this  will  depend  on  the  questions  regarding  the  method  that  we  want 
to  answer: 

•  Let  us  say  that  our  customer  asks  us  to  deliver  a  classifier  for  a  given  problem.  If 
we  just  want  to  provide  the  best  model,  then  we  may  use  cross-validation  on  our 
training  dataset  and  select  the  model  with  the  best  performance.  In  this  scenario, 
when  we  return  the  trained  classifier  to  our  customer,  we  know  that  it  is  the  one 
that  achieves  the  best  performance.  But  if  the  customer  asks  about  the  expected 
performance,  we  cannot  say  anything. 

A  practical  issue:  once  we  have  selected  the  model,  we  use  the  complete  training 
set  to  train  the  final  model. 

•  If  we  want  to  know  about  the  performance  of  our  model,  we  have  to  use  unseen 
data.  Thus,  we  may  proceed  in  the  following  way: 

1.  Split  the  original  dataset  into  training  and  test  data.  For  example,  use  30%  of 
the  original  dataset  for  testing  purposes.  This  data  is  held  back  and  will  only  be 
used  to  assess  the  performance  of  the  method. 

2.  Use  the  remaining  training  data  to  select  the  hyperparameters  by  means  of  cross- 
validation. 

3.  Train  the  model  with  the  selected  parameter  and  assess  the  performance  using 
the  test  dataset. 

A  practical  issue:  Observe  that  by  splitting  the  data  into  three  sets,  the  classifier 
is  trained  with  a  smaller  fraction  of  the  data. 

•  If  we  want  to  make  a  good  comparison  of  classifiers  but  we  do  not  care  about 
the  best  parameters,  we  may  use  nested  cross-validation.  Nested  cross-validation 
runs  two  cross-validation  processes.  An  external  cross-validation  is  used  to  assess 
the  performance  of  the  classifier  and  in  each  loop  of  the  external  cross-validation 
another  cross-validation  is  run  with  the  remaining  training  set  to  select  the  best 
parameters. 

If  we  want  to  select  the  best  complexity  of  a  decision  tree,  we  can  use  tenfold  cross- 
validation  checking  for  different  complexity  parameters.  If  we  change  the  maximum 
depth  of  the  method,  we  obtain  the  results  in  Fig.  5.8. 
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Fig.  5.8  Box  plot  showing  accuracy  for  different  complexities  of  the  decision  tree 


In  [14] 


r 


#  Create  a  10-fold  cross-validation  set 

kf  =  c  r  o  s  s_va 1 idation . KFold (n  =  y .  shape  [  0 ]  , 

n_f o Ids  =  10, 

shuffle  =  True, 
random_state  =  0) 


#  Search  for  the  parameter  among  the  following: 
C  =  np . arange  (2  ,  20,) 


ac  c  =  np  .zeros  ((10,  18)) 

i  =  0 

for  train_index,  val_index  in  kf : 

X_train ,  X_val  =  X[train_index] ,  X[val_index] 
y_train ,  y_val  =  y[train_index] ,  y[val_index] 

j  =  0 

for  c  in  C  : 

dt  =  t ree . Dec i s i onTreeC las s i f i er ( 
mi n_s amp 1 e s_l ea f  =  1, 

max_depth  =  c) 
dt . f it (X_train ,  y_train) 
yhat  =  dt . predict (X_val ) 

acc  [  i  ]  [j]  =  metrics.  accuracy_score  ( yhat  ,  y_val ) 

j  =  j  +  1 
i  =  i  +  1 


y 


Checking  Fig.  5.8,  we  can  see  that  the  best  average  accuracy  is  obtained  by  the 
fifth  model,  a  maximum  depth  of  6.  Although  we  can  report  that  the  best  accuracy 
is  estimated  to  be  found  with  a  complexity  value  of  6,  we  cannot  say  anything  about 
the  value  it  will  achieve.  In  order  to  have  an  estimation  of  that  value,  we  need  to  run 
the  model  on  a  new  set  of  data  that  are  completely  unseen,  both  in  training  and  in 
model  selection  (the  model  selection  value  is  positively  biased).  Let  us  put  everything 
together.  We  will  be  considering  a  simple  train_test  split  for  testing  purposes  and 
then  run  cross-validation  for  model  selection. 
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In  [15] 


/ - \ 

#  Train_test  split 

X_train ,  X_test,  y_train,  y_test  =  c r o s s_va 1 i da t i on 
.  t r a i n_t e s t_sp 1 i t  (  X  ,  y,  test_size  =  0.20) 

#  Create  a  10-fold  cross-validation  set 

kf  =  c r o s s_va 1 i da t i on . KF o 1 d ( n  =  y_t rain . shape [ 0 ] , 

n_f o Ids  =  10, 

shu f  f 1 e  =  True  , 
random_state  =  0) 

#  Search  the  parameter  among  the  following 

C  =  np . arange  (2  ,  20,) 

ac  c  =  np  .zeros  ((10,  18)) 

i  =  0 

for  train_index,  val_index  in  kf : 

X_t ,  X_val  =  X_train[train_index],  X_train [ val_index] 
y_t ,  y_val  =  y_train [train_index] ,  y_train [val_index] 

j  =  0 

for  c  in  C  : 

dt  =  tree .DecisionTreeClassifierf 
mi n_s amp 1 e s_l ea f  =  1, 

max_depth  =  c) 
dt  .  f  i  t  (  X_t  ,  y_t  ) 
yhat  =  dt . predict (X_val ) 

acc  [  i  ]  [j]  =  metrics.  accuracy_score  ( yhat  ,  y_val ) 

j  =  j  +  1 
i  =  i  +  1 

print  'Mean  accuracy:  '  +  str ( np .mean (acc ,  axis  =  0)) 

print  'Selected  model  index:  '  + 

s  tr  (  np  .  argmax  (  np  .mean  (acc  ,  axis  =  0))) 

V _ / 


Out [ 15] : Mean  accuracy:  [0.8254832  0.83031158  0.83091854  0.83423816 
0.83363939  0.83303516  0.82759983  0.82337022  0.82034725 
0.81642795  0.80947567  0.79951316  0.80162614  0.79226695 
0.79589324  0.785928  0.78049267  0.78320988] 

Selected  model  index:  3 

If  we  run  the  output  of  this  code,  we  observe  that  the  best  accuracy  is  provided 
by  the  fourth  model.  In  this  example  it  is  a  model  with  complexity  5.  The  selected 
model  achieves  a  success  rate  of  0.83423816  in  validation.  We  then  train  the  model 
with  the  complete  training  set  and  verify  its  test  accuracy. 


9This  reduction  in  the  complexity  of  the  best  model  should  not  surprise  us.  Remember  that  com¬ 
plexity  and  the  number  of  examples  are  intimately  related  for  the  learning  to  succeed.  By  using  a 
test  set  we  perform  model  selection  with  a  smaller  dataset  than  in  the  former  case. 
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In  [16] 


#  Train  the  model  with  the  complete  training  set  with  the 

selected  complexity 

dt  =  tree .DecisionTreeClassifier( 
mi n_s amp 1 e s_l ea f  =  1, 

max_depth  =  C [ np .argmax(np.mean(acc,  axis  =  0))]) 

dt . fit (X_train , y_t r a i n ) 

#  Test  the  model  with  the  test  set 
yhat  =  dt . predict ( X_test ) 

print  'Test  accuracy:  '  + 

str (metrics . ac cur acy_s core ( yhat ,  y_tes t ) ) 


Out[16]:Test  accuracy:  0.826086956522 


In  [17] 


As  expected,  the  value  is  slightly  reduced;  it  achieves  0.82608.  Finally,  the  model 
is  trained  with  the  complete  dataset.  This  will  be  the  model  used  in  exploitation  and 

we  expect  to  at  least  achieve  an  accuracy  rate  of  0.82608. 

/ - \ 

#  Train  the  final  model 

dt  =  tree .DecisionTreeClassifier (min_samples_leaf  =  1, 

max_dep th  =  C [ np .argmax(np.mean(acc,  axis  =  0))]) 

dt  .  f  i  t  (  X  ,  y  ) 

\ _ 2 


5.7  Two  Learning  Models 

Let  us  return  to  our  problem  and  check  the  performance  of  different  models.  There 
are  many  learning  models  in  the  machine  learning  literature.  However,  in  this  short 
introduction  we  focus  on  two  of  the  most  important  and  pragmatically  effective 
approaches  :  support  vector  machines  (SVM)  and  random  forests  (RF). 


5.7.1  Generalities  Concerning  Learning  Models 

Before  going  into  some  of  the  details  of  the  models  selected,  let  us  check  the  com¬ 
ponents  of  any  learning  algorithm.  In  order  to  be  able  to  learn,  an  algorithm  has  to 
define  at  least  three  components: 

•  The  model  class/hypothesis  space  defines  the  family  of  mathematical  models  that 
will  be  used.  The  target  decision  boundary  will  be  approximated  from  one  element 
of  this  space.  For  example,  we  can  consider  the  class  of  linear  models.  In  this  case 
our  decision  boundary  will  be  a  line  if  the  problem  is  defined  in  R2  and  the  model 
class  is  the  space  of  all  possible  lines  in  R2. 


10These  techniques  have  been  shown  to  be  two  of  the  most  powerful  families  for  classification  [1]. 
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Model  classes  define  the  geometric  properties  of  the  decision  function.  There  are 
different  taxonomies  but  the  best  known  are  the  families  of  linear  and  nonlinear 
models.  These  families  usually  depend  on  some  parameters;  and  the  solution  to  a 
learning  problem  is  the  selection  of  a  particular  set  of  parameters,  i.e.,  the  selection 
of  an  instance  of  a  model  from  the  model  class  space.  The  model  class  space  is 
also  called  the  hypothesis  space. 

The  selection  of  the  best  model  will  depend  on  our  problem  and  what  we  want 
to  obtain  from  the  problem.  The  primary  goal  in  learning  is  usually  to  achieve 
the  minimum  error/maximum  performance;  but  according  to  what  else  we  want 
from  the  algorithm,  we  can  come  up  with  different  algorithms.  Other  common 
desirable  properties  are  interpretability,  behavior  when  faced  with  missing  data, 
fast  training,  etc. 

•  The  problem  model  formalizes  and  encodes  the  desired  properties  of  the  solution. 
In  many  cases,  this  formalization  takes  the  form  of  an  optimization  problem.  In  its 
most  basic  instantiation,  the  problem  model  can  be  the  minimization  of  an  error 
function.  The  error  function  measures  the  difference  between  our  model  and  the 
target.  Informally  speaking,  in  a  classification  problem  it  measures  how  “irritated” 
we  are  when  our  model  misses  the  right  label  for  a  training  sample.  For  example, 
in  classification,  the  ideal  error  function  is  the  0-1  loss.  This  function  takes  value 
1  when  we  incorrectly  classify  a  training  sample  and  zero  otherwise.  In  this  case, 
we  can  interpret  it  by  saying  that  we  are  only  irritated  by  “one  unit  of  irritation” 
when  one  sample  is  misclassified. 

The  problem  model  can  also  be  used  to  impose  other  constraints  on  our  solution,11 
such  as  finding  a  smooth  approximation,  a  model  with  a  low  degree  of  small 
complexity,  a  sparse  solution,  etc. 

•  The  learning  algorithm  is  an  optimization/search  method  or  algorithm  that,  given 
a  model  class,  fits  it  to  the  training  data  according  to  the  error  function.  According 
to  the  nature  of  our  problem  there  are  many  different  algorithms.  In  general,  we 
are  talking  about  finding  the  minimum  error  approximation  or  maximum  probable 
model.  In  those  cases,  if  the  problem  is  convex/quasi-convex  we  will  typically  use 
first-  or  second-order  methods  (i.e.,  gradient  descent,  coordinate  descent,  Newton’s 
method,  interior  point  methods,  etc.).  Other  searching  techniques  such  as  genetic 
algorithms  or  Monte  Carlo  techniques  can  be  used  if  we  do  not  have  access  to  the 
derivatives  of  the  objective  function. 


5.7.2  Support  Vector  Machines 

S  VM  is  a  learning  technique  initially  designed  to  fit  a  linear  boundary  between  the 
samples  of  a  binary  problem,  ensuring  the  maximum  robustness  in  terms  of  tolerance 
to  isotropic  uncertainty.  This  effect  is  observed  in  Fig.  5.9.  Note  that  the  boundary 
displayed  has  the  largest  distance  to  the  closest  point  of  both  classes.  Any  other 


11  Remember  the  regularization  cure  for  overfitting. 
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Fig.  5.9  Support  vector 
machine  decision  boundary 
and  the  support  vectors 


separating  boundary  will  have  a  point  of  a  class  closer  to  it  than  this  one.  The  figure 
also  shows  the  closest  points  of  the  classes  to  the  boundary.  These  points  are  called 
support  vectors .  In  fact,  the  boundary  only  depends  on  those  points.  If  we  remove 
any  other  point  from  the  dataset,  the  boundary  remains  intact.  However,  in  general, 
if  any  of  these  special  points  is  removed  the  boundary  will  change. 


5.7.2. 1  A  Brief  Note  on  Deriving  Hard  Margin  Support  Vector  Machines 

In  order  to  understand  the  model,  we  have  to  be  able  to  approximately  derive  its  for¬ 
mulation.  For  this  purpose  it  is  important  to  understand  a  couple  of  things  about  basic 
geometry  of  a  hyperplane.  A  hyperplane  in  is  defined  as  an  affine  combination  of 
the  variables:  ir  =  aTx-\-b  =  0.  A  hyperplane  splits  the  space  into  two  half- spaces. 
The  evaluation  of  the  equation  of  the  hyperplane  on  any  element  belonging  to  one 
of  the  half-spaces  is  a  positive  value.  It  is  a  negative  value  for  all  the  elements  in  the 
other  half-space.  The  distance  of  a  point  v  e  to  the  hyperplane  n  is 

I  aT  x  +  b I 
d  (x ,  7 r)  =  - 

\\d  || 2 


Given  a  binary  classification  problem  with  training  data  V  =  {(v/,  yi )},  i  = 
l ...  N,  yi  e  {+1,  —  1},  consider  S  c  V  the  subset  of  all  data  points  belonging  to 
class  +1,  S  =  {xi\yi  =  +1},  and  1Z  =  {xi  \yi  =  —1}  its  complement. 
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Then  the  problem  of  finding  a  separating  hyperplane  consists  of  fulfilling  the 
following  constraints 


rp  rp 

a  Si  +b  >  0  and  a  r;  +  b  <  0  Vst  e  S,  rt  e  7 Z. 

This  is  a  feasibility  problem  and  it  is  usually  written  in  the  following  way  in 
optimization  standard  notation: 


minimize  1 

T 

subject  to  yi(a  Xi  +  b)  >  1,  Vx;  e  V 


The  solution  of  this  problem  is  not  unique.  Selecting  the  maximum  margin  hyper¬ 
plane  requires  us  to  add  a  new  constraint  to  our  problem.  Remember  from  the  geom¬ 
etry  of  the  hyperplane  that  the  distance  of  any  point  to  a  hyperplane  is  given  by: 


d(X,TT)  =  4^. 

V  ’  \W\\l 

Recall  also  that  we  want  positive  data  to  be  beyond  value  1  and  negative  data 
below  —1.  Thus,  what  is  the  distance  value  we  want  to  maximize? 

The  positive  point  closest  to  the  boundary  is  at  l/||a||2  and  the  negative  point 
closest  to  the  boundary  data  point  is  also  at  1/ \\a  || 2-  Thus,  data  points  from  different 
classes  are  at  least  l/\\a\\2  apart. 

Recall  that  our  goal  is  to  find  the  separating  hyperplane  with  maximum  margin, 
i.e.,  with  maximum  distance  between  elements  in  the  different  classes.  Thus,  we  can 
complete  the  former  formulation  with  our  last  requirement  as  follows: 


minimize  \\a\\2/2 

T 

subject  to  yi(a  Xi  +b)  >  1,  Vx*  e  V 

This  formulation  has  a  solution  as  long  as  the  problem  is  linearly  separable. 

In  order  to  deal  with  misclassifications,  we  are  going  to  introduce  a  new  set  of 
variables  £*,  that  represents  the  amount  of  violation  in  the  i- th  constraint.  If  the 
constraint  is  already  satisfied,  then  £/  =  0;  while  £*•  >  0  otherwise.  Because  £*  is 
related  to  the  errors,  we  would  like  to  keep  this  amount  as  close  to  zero  as  possible. 
This  makes  us  introduce  an  element  in  the  objective  trade-off  with  the  maximum 
margin. 


12Note  the  strict  inequalities  in  the  formulation.  Informally,  we  can  consider  the  smallest  satisfied 
constraint,  and  observe  that  the  rest  must  be  satisfied  with  a  larger  value.  Thus,  we  can  arbitrarily 
set  that  value  to  1  and  rewrite  the  problem  as 

7~T 

a  Si  +  b  >  1  and  a  r;  +  b  <  —  1 . 
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The  new  model  becomes: 

N 

\\a\\2/2  +  cJ2^ 

i  =  1 

Vi (a'  -Xj  +  b)  >  \  -  .  i  =  1 . . .  N 

Zi  >0 

where  C  is  the  trade-off  parameter  that  roughly  balances  the  rates  of  margin  and 
misclassification.  This  formulation  is  also  called  soft-margin  SVM. 

The  larger  the  C  value  is,  the  more  importance  one  gives  to  the  error,  i.e.,  the 
method  will  be  more  accurate  according  to  the  data  at  hand,  at  the  cost  of  being  more 
sensitive  to  variations  of  the  data. 

The  decision  boundary  of  most  problems  cannot  be  well  approximated  by  a  linear 
model.  In  SVM,  the  extension  to  the  nonlinear  case  is  handled  by  means  of  kernel 
theory.  In  a  pragmatic  way,  a  kernel  can  be  referred  to  as  any  function  that  captures 
the  similarity  between  any  two  samples  in  the  training  set.  The  kernel  has  to  be  a 
positive  semi-definite  function  as  follows: 


minimize 
subject  to 


•  Linear  kernel: 


k(xi,  xj )  =  v 


•  Polynomial  kernel : 

k(xi,xj)  =  (1  +xj  xjY 


•  Radial  Basis  Function  kernel : 


k(Xi,Xj)  =  e  2a2 


Note  that  selecting  a  polynomial  or  a  Radial  Basis  Function  kernel  means  that  we 
have  to  adjust  a  second  parameter  p  or  cr,  respectively.  As  a  practical  summary,  the 
SVM  method  will  depend  on  two  parameters  (C,  7)  that  have  to  be  chosen  carefully 
using  cross-validation  to  obtain  the  best  performance. 


5.7.3  Random  Forest 

Random  Forest  (RF)  is  the  other  technique  that  is  considered  in  this  work.  RF  is 
an  ensemble  technique.  Ensemble  techniques  rely  on  combining  different  classifiers 
using  some  aggregation  technique,  such  as  majority  voting.  As  pointed  out  earlier, 
ensemble  techniques  usually  have  good  properties  for  combating  overfitting.  In  this 
case,  the  aggregation  of  classifiers  using  a  voting  technique  reduces  the  variance  of 
the  final  classifier.  This  increases  the  robustness  of  the  classifier  and  usually  achieves 
a  very  good  classification  performance.  A  critical  issue  in  the  ensemble  of  classifiers 
is  that  for  the  combination  to  be  successful,  the  errors  made  by  the  members  of  the 
ensemble  should  be  as  uncorrelated  as  possible.  This  is  sometimes  referred  to  in  the 
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literature  as  the  diversity  of  the  classifiers.  As  the  name  suggests,  the  base  classifiers 
in  RF  are  decision  trees. 


5.7.3. 1  A  Brief  Note  on  Decision  Trees 

A  decision  tree  is  one  of  the  most  simple  and  intuitive  techniques  in  machine  learning, 
based  on  the  divide  and  conquer  paradigm.  The  basic  idea  behind  decision  trees  is  to 
partition  the  space  into  patches  and  to  fit  a  model  to  a  patch.  There  are  two  questions 
to  answer  in  order  to  implement  this  solution: 

•  How  do  we  partition  the  space? 

•  What  model  shall  we  use  for  each  patch? 

Tackling  the  first  question  leads  to  different  strategies  for  creating  decision  tree. 
However,  most  techniques  share  the  axis-orthogonal  hyperplane  partition  policy, 
i.e.,  a  threshold  in  a  single  feature.  For  example,  in  our  problem  “Does  the  applicant 
have  a  home  mortgage?”.  This  is  the  key  that  allows  the  results  of  this  method  to  be 
interpreted.  In  decision  trees,  the  second  question  is  straightforward,  each  patch  is 
given  the  value  of  a  label,  e.g.,  the  majority  label,  and  all  data  falling  in  that  part  of 
the  space  will  be  predicted  as  such. 

The  RF  technique  creates  different  trees  over  the  same  training  dataset.  The  word 
“random”  in  RF  refers  to  the  fact  that  only  a  subset  of  features  is  available  to  each 
of  the  trees  in  its  building  process.  The  two  most  important  parameters  in  RF  are  the 
number  of  trees  in  the  ensemble  and  the  number  of  features  each  tree  is  allowed  to 
check. 


5.8  Ending  the  Learning  Process 

With  both  techniques  in  mind,  we  are  going  to  optimize  and  check  the  results  using 
nested  cross-validation.  Scikit-learn  allows  us  to  do  this  easily  using  several  model 
selection  techniques.  We  will  use  a  grid  search,  Gr  idSearchCV  (a  cross-validation 
using  an  exhaustive  search  over  all  combinations  of  parameters  provided). 
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In  [16] : 


Out [ 16 ] 


( - \ 

parameters  =  {'C':  [le4,  le5,  1 e  6 ]  , 

' gamma [ 1 e -  5 ,  le-4,  1 e -  3 ]  } 

N_f olds  =  5 

kf  =  c  r  o  s  s_va lidation . KFold (n  =  y .  shape  [  0 ]  , 

n_f olds  =  N_f olds , 
shuffle  =  True, 
random_state  =  0) 

acc  =  np .  zeros  (  ( N_f olds  ,  )  ) 
i  =  0 

#  We  will  build  the  predicted  y  from  the  partial  predictions 
on  the  test  of  each  of  the  folds 
yha t  =  y .  copy  (  ) 

for  train_index,  test_index  in  kf : 

X_train ,  X_test  =  X [ train_ index , : ] ,  X[test_index , : ] 
y_t r a in  ,  y_t e  s t  =  y[train_index],  y[test_index] 
scaler  =  S t anda r dS c a 1 e r ( ) 

X_train  =  scaler . f it_transform (X_train) 
elf  =  svm .  SVC  ( kernel  =  '  rbf  '  ) 

elf  =  gr i d_s ear ch . Gr i dS ear chC V ( c 1 f ,  parameters,  cv  =  3) 

c 1 f .  f i t  ( X_t r a in  ,  y_t r a i n .  r ave 1  (  )  ) 

X_test  =  scaler . transform (X_test ) 
yhat [ tes t_index ]  =  elf . predict ( X_test ) 

print  metrics . accuracy_score ( yhat ,  y) 
print  metrics . confusion_matrix (yhat ,  y) 

V _ / 


classification  accuracy:  0.856038647343 

confusion  matrix: 

3371  590 

6  173 

The  result  obtained  has  a  large  error  in  the  non-fully  funded  class  (negative).  This 
is  because  the  default  scoring  for  cross-validation  grid-search  is  mean  accuracy. 
Depending  on  our  business,  this  large  error  in  recall  for  this  class  may  be  unaccept¬ 
able.  There  are  different  strategies  for  diminishing  the  impact  of  this  effect.  On  the 
one  hand,  we  may  change  the  default  scoring  and  find  the  parameter  setting  that  cor¬ 
responds  to  the  maximum  average  recall.  On  the  other  hand,  we  could  mitigate  this 
effect  by  imposing  a  different  weight  on  an  error  on  the  critical  class.  For  example, 
we  could  look  for  the  best  parameterization  such  than  one  error  on  the  critical  class 
is  equivalent  to  one  thousand  errors  on  the  noncritical  class.  This  is  important  in 
business  scenarios  where  monetization  of  errors  can  be  derived. 


5.9  A  Toy  Business  Case 

Consider  that  clients  using  our  service  yield  a  profit  of  100  units  per  client  (we  will  use 
abstract  units  but  keep  in  mind  that  this  will  usually  be  accounted  in  euros/dollars). 
We  design  a  campaign  with  the  goal  of  attracting  investors  in  order  to  cover  all 
non-fully  funded  loans.  Let  us  assume  that  the  cost  of  the  campaign  is  a  units 
per  client.  With  this  policy  we  expect  to  keep  our  customers  satisfied  and  engaged 
with  our  service,  so  they  keep  using  it.  Analyzing  the  confusion  matrix  we  can 
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Fig.  5.10  Surfaces  for  two 
different  campaign  and 
attraction  factors.  The 
horizontal  plane  corresponds 
to  the  profit  if  no  campaign 
is  launched.  The  slanted 
plane  is  the  profit  for  a 
certain  confusion  matrix 
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give  precise  meaning  to  different  concepts  in  this  campaign.  The  real  positive  set 
(TP  +  FN)  consists  of  the  number  of  clients  that  are  fully  funded.  According  to 
our  assumption,  each  of  these  clients  generates  a  profit  of  100  units.  The  total  profit 
is  100  •  (TP  +  FN).  The  campaign  to  attract  investors  will  be  cast  considering  all 
the  clients  we  predict  are  not  fully  funded.  These  are  those  that  the  classifier  predict 
as  negative,  i.e.,  (FN  +  TN).  However,  the  campaign  will  only  have  an  effect  on 
the  investors/clients  that  are  actually  not  funded,  i.e.,  T N\  and  we  expect  to  attract  a 
certain  fraction  (3  of  them.  After  deploying  our  campaign,  a  simplified  model  of  the 
expected  profit  is  as  follows: 

100  •  (TP  +  FN)  -  a(TN  +  FN)  +  100 (3TN 

When  optimizing  the  classifier  for  accuracy,  we  do  not  consider  the  business  needs. 
In  this  case,  optimizing  an  S  VM  using  cross-validation  for  different  parameters  of  the 
C  and  7,  we  have  an  accuracy  of  85.60%  and  a  confusion  matrix  with  the  following 
values: 

/ 3371.  590. \ 

\  6.  173. y 

If  we  check  how  the  profit  changes  for  different  values  of  a  and  /3 ,  we  obtain  the  plot 
in  Fig.  5.10.  The  figure  shows  two  hyperplanes.  The  horizontal  plane  is  the  expected 
profit  if  the  campaign  is  not  launched,  i.e.,  100  •  (TP  +  FN).  The  other  hyperplane 
represents  the  profit  of  the  campaign  for  different  values  of  a  and  (3  using  a  particular 
classifier.  Remember  that  the  cost  of  the  campaign  is  given  by  a,  and  the  success  rate 
of  the  campaign  is  represented  by  (3.  For  the  campaign  to  be  successful  we  would 
like  to  select  values  for  both  parameters  so  that  the  profit  of  the  campaign  is  larger 
than  the  cost  of  launching  it.  Observe  in  the  figure  that  certain  costs  and  attraction 
rates  result  in  losses. 

We  may  launch  different  classifiers  with  different  configurations  and  toy  with  dif¬ 
ferent  weights  (2,4,8,  16)  for  elements  of  different  classes  in  order  to  bias  the  classi- 
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(a) 


Fig.  5.11  3D  surfaces  of  the  profit  obtained  for  different  classifiers  and  configurations  of  retention 
campaign  cost  and  retention  rate,  a  RF,  b  SVM  with  the  same  cost  per  class,  c  SVM  with  double 
cost  for  the  target  class,  d  SVM  with  a  cost  for  the  target  class  equal  to  4,  e  SVM  with  a  cost  for 
the  target  class  equal  to  8,  f  SVM  with  a  cost  for  the  target  class  equal  to  16 


fier  towards  obtaining  different  values  for  the  confusion  matrix. 13  The  weights  define 


13  It  is  worth  mentioning  that  another  useful  tool  for  visualizing  the  trade-off  between  true  positives 
and  false  positives  in  order  to  choose  the  operating  point  of  the  classifier  is  the  receiver-operating 
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Table  5.1  Different  configurations  of  classifiers  and  their  respective  profit  rates  and  accuracies 


Max  profit  rate  (%) 

Profit  rate  at  60%  (%) 

Accuracy  (%) 

Random  forest 

4.41 

2.41 

87.87 

SVM  {1  :  1} 

4.59 

2.54 

85.60 

SVM  {1  :  2} 

4.52 

2.50 

85.60 

SVM  {1  :  4} 

4.30 

2.28 

83.81 

SVM  {1  :  8} 

10.69 

3.57 

52.51 

SVM  {1  :  16} 

10.68 

2.88 

41.40 

how  much  a  misclassification  in  one  class  counts  with  respect  to  a  misclassification 
in  another.  Figure  5.1 1  shows  the  different  landscapes  for  different  configurations  of 
the  SVM  classifier  and  RF. 

In  order  to  frame  the  problem,  we  consider  a  very  successful  campaign  with  a 
60%  investor  attraction  rate.  We  can  ask  several  questions  in  this  scenario: 

•  What  is  the  maximum  amount  to  be  spent  on  the  campaign? 

•  How  much  will  I  gain? 

•  From  all  possible  configurations  of  the  classifier,  which  is  the  most  profitable? 

•  Is  it  the  one  with  the  best  accuracy? 

Checking  the  values  in  Fig.  5. 1 1 ,  we  find  the  results  collected  in  Table  5. 1 .  Observe 
that  the  most  profitable  campaign  with  60%  corresponds  to  a  classifier  that  considers 
the  cost  of  mistaking  a  sample  from  the  non-fully  funded  class  eight  times  larger 
than  the  one  from  the  other  class.  Observe  also  that  the  accuracy  in  that  case  is  much 
worse  than  in  other  configurations. 

The  take-home  idea  of  this  section  is  that  business  needs  are  often  not  aligned  with 
the  notion  of  accuracy.  In  such  scenarios,  the  confusion  matrix  values  have  specific 
meanings.  This  must  be  taken  into  account  when  tuning  the  classifier. 


5.10  Conclusion 

In  this  chapter  we  have  seen  the  basics  of  machine  learning  and  how  to  apply  learning 
theory  in  a  practical  case  using  Python.  The  example  in  this  chapter  is  a  basic  one 
in  which  we  can  safely  assume  the  data  are  independent  and  identically  distributed, 
and  that  they  can  be  readily  represented  in  vector  form.  However,  machine  learning 


(Footnote  13  continued) 

characteristic  (ROC)  curve.  This  curve  plots  the  true  positive  rate/sensitivity /recall  (TP/(TP+FN)) 
with  respect  to  the  false  positive  rate  (FP/(FP+TN)). 
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may  tackle  many  more  different  settings.  For  example,  we  may  have  different  target 
labels  for  a  single  example;  this  is  called  multilabel  learning.  Or,  data  can  come 
from  streams  or  be  time  dependent;  in  these  settings,  sequential  learning  or  sequence 
learning  can  be  the  methods  of  choice.  Moreover,  each  data  example  can  be  a  non¬ 
vector  or  have  a  variable  size,  such  as  a  graph,  a  tree,  or  a  string.  In  such  scenarios 
kernel  learning  or  structural  learning  may  be  used.  During  these  last  years  we  are  also 
seeing  the  revival  of  neural  networks  under  the  name  of  deep  learning  and  achieving 
impressive  results  in  different  domains  such  as  computer  vision  or  natural  language 
processing.  Nonetheless,  all  of  these  methods  will  behave  as  explained  in  this  chapter 
and  most  of  the  lessons  learned  here  can  be  readily  applied  to  these  techniques. 

Acknowledgements  This  chapter  was  co-written  by  Oriol  Pujol  and  Petia  Radeva. 
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Regression  Analysis 


6.1  Introduction 

In  this  chapter,  we  introduce  regression  analysis  and  some  of  its  applications  in  data 
science.  Regression  is  related  to  how  to  make  predictions  about  real-world  quantities 
such  as,  for  instance,  the  predictions  alluded  to  in  the  following  questions.  How  does 
sales  volume  change  with  changes  in  price?  How  is  sales  volume  affected  by  the 
weather?  How  does  the  title  of  a  book  affect  its  sales?  How  does  the  amount  of  a 
drug  absorbed  vary  with  the  patient’s  body  weight;  and  does  this  relationship  depend 
on  blood  pressure?  How  many  customers  can  I  expect  today?  At  what  time  should  I 
go  home  to  avoid  traffic  jams?  What  is  the  chance  of  rain  on  the  next  two  Mondays; 
and  what  is  the  expected  temperature? 

All  these  questions  have  a  common  structure:  they  ask  for  a  response  that  can 
be  expressed  as  a  combination  of  one  or  more  (independent)  variables  (also  called 
covariates  or  predictors).  The  role  of  regression  is  to  build  a  model  to  predict  the 
response  from  the  variables.  This  process  involves  the  transition  from  data  to  model. 

More  specifically,  the  model  can  be  useful  in  different  tasks,  such  as  the  following: 
(1)  analyzing  the  behavior  of  data  (the  relation  between  the  response  and  the  vari¬ 
ables),  (2)  predicting  data  values  (whether  continuous  or  discrete),  and  (3)  finding 
important  variables  for  the  model. 

In  order  to  understand  how  a  regression  model  can  be  suitable  for  tackling  these 
tasks,  we  will  introduce  three  practical  cases  for  which  we  use  three  real  datasets  and 
solve  different  questions.  These  practical  cases  will  motivate  simple  linear  regression , 
multiple  linear  regression ,  and  logistic  regression ,  as  presented  in  the  following 
sections. 
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Fig.  6.1  Illustration  of  different  simple  linear  regression  models.  Blue  points  correspond  to  a  set 
of  random  points  sampled  from  a  univariate  normal  (Gaussian)  distribution.  Red,  green  and  yellow 
lines  are  three  different  simple  linear  regression  models 


6.2  Linear  Regression 

The  objective  of  performing  a  regression  is  to  build  a  model  to  express  the  relation 
between  the  response  y  e  W1  and  a  combination  of  one  or  more  (independent)  vari¬ 
ables  x*  e  M77.  [1]  The  model  allows  us  to  predict  the  response  y  from  the  variables. 
The  simplest  model  which  can  be  considered  is  a  linear  model ,  where  the  response 
y  depends  linearly  on  the  d  variables  x; : 

y  =  aixi  H - b  adxd.  (6.1) 

The  variables  <2/  are  termed  the  parameters  or  coefficients  of  the  model.  This 
equation  can  be  rewritten  in  a  more  compact  matrix  form:  y  =  Xw,  where 


^1^ 

^*11  ...X\ d\ 

/ a\  \ 

y  = 

yi 

,X  = 

x2\  •  •  •  x2 d 

,  w  = 

C12 

\yn) 

\  xn\  •  •  •  xnd  } 

\ad  ) 

Linear  regression  is  the  technique  for  creating  these  linear  models. 


6.2.1  Simple  Linear  Regression 

Simple  linear  regression  considers  n  samples  of  a  single  variable  x  e  M77  and 
describes  the  relationship  between  the  variable  and  the  response  with  the  model: 

y  =  <2o  +  <2ix,  (6.2) 

where  the  parameter  <20  is  called  the  intercept  or  the  constant  term. 

Given  a  set  of  samples  (x,  y),  such  as  the  set  illustrated  in  Fig.  6.1,  we  can  create 
a  linear  model  to  explain  the  data,  as  in  Eq.  (6.2).  But  how  do  we  know  which  is  the 
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best  model  (best  parameters)  for  this  particular  set  of  samples?  See  the  three  different 
models  (straight  lines  in  different  colors)  in  Fig.  6.1. 

Ordinary  least  squares  (OLS)  is  the  simplest  and  most  common  estimator  in  which 
the  parameters  (<Fs)  are  chosen  to  minimize  the  square  of  the  distance  between  the 
predicted  values  and  the  actual  values  with  respect  to  a$,a\ : 

n 

||ao  +  aix  —  y 1 12  =  ^(ao+ai*/  -  yjj1 2- 

j= 1 

We  are  concerned  here  with  the  y-axis  distance,  since  it  does  not  consider  the  error 
in  the  variables.  This  error  expression  is  often  called  the  sum  of  squared  errors  of 
prediction  (SSE).  The  SSE  function  is  quadratic  in  the  parameters,  w,  with  positive- 
definite  Hessian,  and  therefore  this  function  possesses  a  unique  global  minimum  at 
w  =  (do,  a\).  The  resulting  model  is  represented  as  follows:  y  =  do  +  a\x,  where 
the  hats  on  the  variables  represent  the  fact  that  they  are  estimated  from  the  data 
available. 

OLS  is  a  popular  approach  for  several  reasons.  It  makes  it  computationally  cheap  to 
calculate  the  coefficients.  It  is  also  easier  to  interpret  than  the  other  more  sophisticated 
models.  In  situations  where  the  goal  is  to  understand  a  simple  model  in  detail,  rather 
than  to  estimate  the  response  well,  it  can  provide  insight  into  what  the  model  captures. 
Finally,  in  situations  where  there  is  a  lot  of  noise,  as  in  many  real  scenarios,  it  may 
be  hard  to  find  the  true  functional  form,  so  a  constrained  model  can  perform  quite 
well  compared  to  a  complex  model  which  can  be  more  affected  by  noise. 

Practical  Case:  Sea  Ice  Data  and  Climate  Change 

In  this  practical  case,  we  pose  the  question:  Is  the  climate  really  changing?  More 
concretely,  we  want  to  show  the  effect  of  the  climate  change  by  determining  whether 
the  sea  ice  area  (or  extent)  has  decreased  over  the  years.  Sea  ice  area  refers  to  the 
total  area  covered  by  ice,  whereas  sea  ice  extent  is  the  area  of  ocean  with  at  least 
15%  sea  ice.  Reliable  measurement  of  sea  ice  edges  began  with  the  satellite  era  in 
the  late  1970s.  Before  then,  sea  ice  area  and  extent  were  monitored  less  precisely  by 
a  combination  of  ships,  buoys,  and  aircraft. 

We  will  use  the  sea  ice  data  from  the  National  Snow  &  Ice  Data  Center  which 
provides  measurements  of  the  area  and  extend  of  sea  ice  at  the  poles  over  the  last 
36  years.  The  center  has  given  access  to  the  archived  monthly  Sea  Ice  Index  images 
and  data  since  1979  [2].  The  archived  data  reside  at  an  FTP  location  (web-page 
instructions  can  be  followed  easily  to  access  and  download  the  files).  The  ASCII 
data  files  tabulate  sea  ice  extent  and  area  (in  millions  of  square  kilometers)  by  year 
for  a  given  month. 

In  order  to  check  whether  there  is  an  anomaly  in  the  evolution  of  sea  ice  extent 
over  recent  years,  we  want  to  build  a  simple  linear  regression  model  and  analyze  the 
fitting;  but  before  we  need  to  perform  several  processing  steps. 


1  https  ://nsidc.org/data/seaice_index/  archives.html. 

2ftp://sidads.colorado.edu/DATASETS/NOAA/G02135/. 
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In  [1]  : 


Out  [  1  ]  : 


In  [2]  : 


In  [3]  : 
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Fig.  6.2  Ice  extent  data  by  month 


First,  we  read  the  data,  previously  downloaded,  and  create  a  DataFrame 
( Pandas )  as  follows: 

/  \ 

ice  =  pd. read_csv ('files/ch06/SeaIce. txt ' , 

del im_whi tespace=True ) 
print  ' shape : ' ,  ice . shape 

v _ y 


shape:  (424,  6) 

For  data  cleaning,  we  check  the  values  of  all  the  fields  to  detect  any  potential  error. 
We  find  that  there  is  a  ‘—9999’  value  in  the  data_type  field  which  should  contain 
‘Goddard’  or  ‘NRTSI-G’  (the  type  of  the  input  dataset).  So  we  can  easily  clean  the 
data,  removing  these  instances. 

/  \ 

ice2  =  ice  [ ice . data_type  !=  '-99  99'] 

\ _ y 

Next,  we  visualize  the  data.  The  lmplot  ( )  function  from  the  Seaborn  toolbox 
is  intended  for  exploring  linear  relationships  of  different  forms  in  multidimensional 
datasets.  For  instance,  we  can  illustrate  the  relationship  between  the  month  of  the 

year  (variable)  and  the  extent  (response)  as  follows: 

/  \ 

import  Seaborn  as  sns 

sns  .  lmplot  (  " mo  ",  "extent"  ,  ice2  ) 

v _ y 

This  outputs  Fig.  6.2.  We  can  observe  a  monthly  fluctuation  of  the  sea  ice  extent, 
as  would  be  expected  for  the  different  seasons  of  the  year. 

We  should  normalize  the  data  before  performing  the  regression  analysis  to  avoid 
this  fluctuation  and  be  able  to  study  the  evolution  of  the  extent  over  the  years.  To 
capture  the  variation  for  a  given  interval  of  time  (month),  we  can  compute  the  mean 
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Fig.  6.3  Ice  extent  data  by  month  after  the  normalization 


for  the  i-th  interval  of  time  (using  the  period  from  1979  through  2014  for  the  mean 
extent)  /z*,  and  subtract  it  from  the  set  of  extent  values  for  that  month  {elj}.  This 
value  can  be  converted  to  a  relative  percentage  difference  by  dividing  it  by  the  total 
average  (1979-2014)  /z,  and  then  multiplying  by  100: 

e\  =  100  *  — - i  =  1, . . . ,  12. 

J  M 

We  implement  this  normalization  and  plot  the  relationship  again  as  follows: 


The  new  output  is  in  Fig.  6.3.  We  now  observe  a  comparable  range  of  values  for 
all  months. 

Next,  the  normalized  values  can  be  plotted  for  the  entire  time  series  to  analyze  the 
tendency.  We  compute  the  trend  as  a  simple  linear  regression.  We  use  the  Implot  ( ) 
function  for  visualizing  linear  relationships  between  the  year  (variable)  and  the  extent 
(response). 

C  \ 

sns  .  Implot  (  "year  "  ,  "extent",  ice2) 

\ _ y 

This  outputs  Fig.  6.4  showing  the  regression  model  fitting  the  extent  data.  This 
plot  has  two  main  components.  The  first  is  a  scatter  plot,  showing  the  observed  data 
points.  The  second  is  a  regression  line,  showing  the  estimated  linear  model  relating 
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Fig.  6.4  Regression  model  fitting  sea  ice  extent  data  for  all  months  by  year  using  Imp  lot 


the  two  variables.  The  regression  line  is  plotted  with  a  95%  confidence  band  to  give 
an  impression  of  the  uncertainty  in  the  model. 

In  this  figure,  we  can  observe  that  the  data  show  a  long-term  negative  trend  over 
years.  The  negative  trend  can  be  attributed  to  global  warming,  although  there  is  also 
a  considerable  amount  of  variation  from  year  to  year. 

Up  until  here,  we  have  qualitatively  shown  the  linear  regression  using  a  useful  visu¬ 
alization  tool.  We  can  also  analyze  the  linear  relationship  in  the  data  using  the  Scikit- 
learn  library,  which  allows  a  quantitative  evaluation.  As  was  explained  in  the  previous 
chapter,  Scikit-learn  provides  an  object-oriented  interface  centered  around  the  con¬ 
cept  of  an  estimator.  The  sklearn .  linear_model .  LinearRegression 
estimator  sets  the  state  of  the  estimator  based  on  the  training  data  using  the  function 
fit.  Moreover,  it  allows  the  user  to  specify  whether  to  fit  an  intercept  term  in  the 
object  construction.  This  is  done  by  setting  the  corresponding  constructor  arguments 
of  the  estimator  object  as  follows: 

f - \ 

from  ski earn . 1 inear_mode 1  import  LinearRegression 
est  =  LinearRegression (fit_intercept  =  True) 

V _ / 

During  the  fitting  process,  the  state  of  the  estimator  is  stored  in  instance 
attributes  that  have  a  trailing  underscore  ('_')•  For  example,  the  coefficients  of  a 
LinearRegression  estimator  are  stored  in  the  attribute  coef_.  We  fit  a  regres¬ 
sion  model  using  years  as  variables  (x)  and  the  extent  values  as  the  response  (y). 

/  \ 

x  =  ice2  [ [ ' year ' ] ] 
y  =  ice2  [  [  ' extent  '  ]  ] 
est . fit (x,  y) 

print  "Coefficients : " ,  est . coef_ 
print  " Intercept : " ,  est . intercept. 

\ _ / 


In  [7]  : 
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Out  [  7  ]  : 


In  [8]  : 


Out  [  8  ]  : 


In  [  9  ]  : 


Out  [  9  ]  : 


Coefficients:  [[-0.45275459]] 
Intercept:  [  903.71640207] 


Estimators  that  can  generate  predictions  provide  an  Estimator . predict 
method.  In  the  case  of  regression,  Estimator  .  predict  will  return  the  predicted 
regression  values.  We  can  evaluate  the  model  fitting  by  computing  the  mean  squared 
error  (MSE)  and  the  coefficient  of  determination  ( R 2)  of  the  model.  The  coefficient 
R 2  is  defined  as  (1  —  u/v),  with  u  =  ~  y)2  and  v  =  —  y)2,  where  y  is  the 

mean.  The  best  possible  score  for  R2  is  1.0,  lower  values  are  worse  (it  can  also  be 
negative).  These  measures  can  provide  a  quantitative  answer  to  the  question  we  are 
facing:  Is  there  a  negative  trend  in  the  evolution  of  sea  ice  extent  over  recent  years? 
We  can  perform  this  analysis  for  a  particular  month  or  for  all  months  together,  as 
done  in  the  following  lines: 
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MSE:  10.5391316398 
R2:  0.50678703821 
var :  31.98324 


The  negative  trend  seen  in  Fig.  6.4  is  validated  by  the  MSE  value  which  is  small, 
0.1%,  and  the  R2  value  which  is  acceptable,  given  the  variance  of  the  data,  0.3%. 

Given  the  model,  we  can  also  predict  the  extent  value  for  the  coming  years.  For 
instance,  the  predicted  extent  for  January  2025  can  be  computed  as  follows: 


x  =  [2025] 

y_hat  =  mode 1 . pr edi c t ( x ) 

m  =  1  #  January 

y_hat  =  ( y_ha t *  mon th_means  .mean  (  )  /100)  +  mon th_means  [  m  ] 

print  "Prediction  of  extent  for  January  2025 
(in  millions  of  square  km) : " ,  y_hat 


v 


y 


Prediction  of  extent  for  January  2025  (in  millions  of  square 
km) :  [12 . 93603933] . 


6.2.2  Multiple  Linear  Regression  and  Polynomial  Regression 

As  we  have  seen  in  the  previous  section,  with  simple  linear  regression  we  describe 
the  relationship  between  the  variable  and  the  response  with  a  straight  line.  In  the 
case  of  multiple  linear  regression ,  we  extend  this  idea  by  fitting  a  d-dimensional 
hyperplane  to  our  d  variables,  as  defined  in  Eq.  (6.1). 

Multiple  linear  regression  may  seem  a  very  simple  model,  but  even  when  the 
response  depends  on  the  variables  in  nonlinear  ways,  this  model  can  still  be  used  by 
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considering  nonlinear  transformations  0(-)  of  the  variables: 

y  =  a\(f) (x i )  H - h  ad(j)(xd) 

This  model  is  called  polynomial  regression  and  it  is  a  popular  nonlinear  regression 
technique  which  models  the  relationship  between  the  response  and  the  variables 
as  an  p- th  order  polynomial.  The  higher  the  order  of  the  polynomial,  the  more 
complex  the  functions  you  can  fit.  However,  using  higher-order  polynomial  can 
involve  computational  complexity  and  overfitting.  Overfitting  occurs  when  a  model 
fits  the  characteristics  of  the  training  data  and  loses  the  capacity  to  generalize  from 
the  seen  to  predict  the  unseen. 


6.2.3  Sparse  Model 

Often,  in  real  problems,  there  are  uninformative  variables  in  the  data  which  prevent 
proper  modeling  of  the  problem  and  thus,  the  building  of  a  correct  regression  model. 
In  such  cases,  a  feature  selection  process  is  crucial  to  select  only  the  informative 
features  and  discard  non-informative  ones.  This  can  be  achieved  by  sparse  methods 
which  use  a  penalization  approach,  such  as  LASSO  (least  absolute  shrinkage  and 
selection  operator)  to  set  some  model  coefficients  to  zero  (thereby  discarding  those 
variables).  Sparsity  can  be  seen  as  an  application  of  Occam’s  razor:  prefer  simpler 
models  to  complex  ones. 

Given  the  set  of  samples  (X,  y),  the  objective  of  a  sparse  model  is  to  minimize 
the  SSE  through  a  restriction  (or  penalty): 

1  2 

—  ||Xw-y||2+a||w||i, 

2  n 

where  |  |w|  |i  is  the  LI -norm  of  the  parameter  vector  w  =  (oq,  . . . ,  afi). 

Practical  Case:  Prediction  of  the  Price  of  a  New  Housing  Market 

In  this  practical  case  we  want  to  solve  the  question:  Can  we  predict  the  price  of  a 
new  market  given  any  of  its  attributes? 

We  will  use  the  Boston  housing  dataset  from  Scikit-learn,  which  provides  recorded 
measurements  of  13  attributes  of  housing  markets  around  Boston,  as  well  as  the 
median  house  price.  Once  we  load  the  dataset  (506  instances),  the  description  of 
the  dataset  can  easily  be  shown  by  printing  the  field  DESCR.  The  data  (x),  feature 
names,  and  target  (y)  are  stored  in  other  fields  of  the  dataset. 

We  first  consider  the  task  of  predicting  median  house  values  in  the  Boston  area 
using  as  the  variable  one  of  the  attributes,  for  instance,  LSTAT,  defined  as  the  “pro¬ 
portion  of  lower  status  of  the  population”. 

Seaborn  visualization  can  be  used  to  show  this  linear  relationships  easily: 


3Copy  of  UCI  ML  housing  dataset:  http://archive.ics.uci.edu/ml/datasets/Housing. 
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In  [10] : 


Out [ 10 ] 


In  [11] : 


Fig.  6.5  Scatter  plot  of  Boston  data  (LSTAT  versus  price)  and  their  linear  relationship  (using 
Imp  lot) 


/ - \ 

from  sklearn  import  datasets 
boston  =  datasets . load_boston ( ) 

X_boston,  y_boston  =  boston. data,  boston . target 
print  ' Shape  of  data : ' ,  X_boston . shape ,  y_boston . shape 
print  ' Feature  names : ' , boston . feature_names 
df_boston  =  pd . DataFrame  ( bos  ton . data , 

columns  =  boston. feature_names) 
df_bos ton [' price '  ]  =  bos  ton .  target 

sns . lmplot ( "price " ,  "LSTAT",  df_boston) 

V _ / 


Shape  of  data:  (506L,  13L)  (506L,) 

Feature  names:  ['CRIM'  'ZN'  'INDUS'  ' CHAS '  'NOX'  'RM'  'AGE' 

'DIS'  'RAD'  'TAX'  ' PTRATIO '  'B'  'LSTAT'] 

In  Fig.  6.5,  we  can  clearly  see  that  the  relationship  between  price  and  LSTAT 
is  nonlinear,  since  the  straight  line  is  a  poor  fit.  We  can  examine  whether  a  better  fit 
can  be  obtained  by  including  higher-order  terms.  For  example,  a  quadratic  model: 

y,-  ao  +  aiXj  +  aixj 

The  lmplot  function  allows  to  easily  change  the  order  of  the  model  as  is  done  in 

the  next  code,  which  outputs  Fig.  6.6,  where  we  observe  a  better  fit. 

/  \ 

sns . lmplot ( "price " ,  "LSTAT",  df_boston,  order  =  2) 

\ _ / 

To  study  the  relation  among  multiple  variables  in  a  dataset,  there  are  different 
options.  We  can  study  the  relationship  between  several  variables  in  a  dataset  by 
using  the  functions  corr  and  he  at  map  which  allow  to  calculate  a  correlation 
matrix  for  a  dataset  and  draws  a  heat  map  with  the  correlation  values.  The  heat  map 
is  a  matricial  image  which  helps  to  interpret  the  correlations  among  variables.  For  the 
sake  of  visualization,  we  do  not  consider  all  the  13  variables  in  the  Boston  housing 
data,  but  six:  CRIM,  per  capita  crime  rate  by  town;  INDUS,  proportion  of  non-retail 
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Fig.  6.6  Scatter  plot  of  Boston  data  (LSTAT  versus  price)  and  their  polynomial  relationship 
(using  Imp  lot  with  order  2) 


In  [12] 


business  acres  per  town;  NOX,  nitric  oxide  concentrations  (parts  per  10  million);  RM, 
average  number  of  rooms  per  dwelling;  AGE,  proportion  of  owner-occupied  units 
built  prior  to  1940;  and  LSTAT.  These  variables  are  indicated  by  their  indexes  in  the 
following  code: 

f  \ 

indexes  =  [0,2,4,5,6,12] 

df 2  =  pd. DataFrame (boston. data [: , indexes] , 

columns  =  boston . feature_names [ indexes ] ) 
df 2  ['price']  =  boston,  target 
corrmat  =  df2  .  corr  (  ) 

sns  . heatmap  (  corrmat  ,  vmax  =  .8,  square  =  True) 

\ _ / 


Figure  6.7  shows  a  heat  map  representing  the  correlation  between  pairs  of  vari¬ 
ables;  specifically,  the  six  variables  selected  and  the  price  of  houses.  The  color  bar 
shows  the  range  of  values  used  in  the  matrix.  This  plot  is  a  useful  way  of  summa¬ 
rizing  the  correlation  of  several  variables.  It  can  be  seen  that  LSTAT  and  RM  are  the 
variables  that  are  most  correlated  with  price. 

Another  good  way  to  explore  multiple  variables  is  the  scatter  plot  from  Pandas. 
The  scatter  plot  is  a  grid  of  plots  of  multiple  variables  one  against  the  others,  illus¬ 
trating  the  relationship  of  each  variable  with  the  rest.  For  the  sake  of  visualization, 
we  do  not  consider  all  the  variables,  but  just  three:  RM,  AGE,  and  LSTAT  defined  by 
indexes  in  the  following  code: 

(  \ 

indexes= [5 , 6 , 12] 

df 2  =  pd. DataFrame  (boston. data  [:  ,  indexes]  , 

columns  =  boston . feature_names [ indexes ] ) 
df 2  ['price']  =  boston,  target 

pd  .  s  c  a  1 1  e  r_ma  t  r ix  (  df  2  ,  figsize  =  (12.0,  12.0)) 

\ _ / 


In  [13] : 
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Fig.  6.7  Correlation  plot: 
heat  map  representing  the 
correlation  between  seven 
pairs  of  variables  in  the 
Boston  housing  dataset 


GRIM  INDUS  NOX  RM  AGE  LSTAT  price 


This  code  outputs  Fig.  6.8,  where  we  obtain  visual  information  concerning  the 
density  function  for  every  variable,  in  the  diagonal,  as  well  as  the  scatter  plots  of  the 
data  points  for  pairs  of  variables.  In  the  last  column,  we  can  appreciate  the  relation 
between  the  three  variables  selected  and  house  prices.  It  can  be  seen  that  RM  follows 
a  linear  relation  with  price;  whereas  AGE  does  not.  LSTAT  follows  a  higher-order 
relation  with  price.  This  plot  gives  us  an  indication  of  how  good  or  bad  every 
attribute  would  be  as  a  variable  in  a  linear  model. 

For  the  evaluation  of  the  prediction  power  of  the  model  with  new  samples,  we  split 
the  data  into  a  training  set  and  a  testing  set,  and  we  compute  the  linear  regression 
score,  which  returns  the  coefficient  of  determination  R2  of  the  prediction.  We  can 
also  calculate  the  MSE. 
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Fig.  6.8  Scatter  plot  of  Boston  housing  dataset 


Out [ 14 ]: Training  and  testing  set  sizes  (253,  13)  (253,  13) 

Coeff  and  intercept:  [  1.20133313  0.02449686  0.00999508 
0.42548672  -8.44272332  8.87767164  -0.04850422  -1.11980855 
0.20377571  -0.01597724  -0.65974775  0.01777057  -0.11480104] 

-10 . 0174305829 

Testing  Score:  -2.24420202674 
Training  MSE :  9.98751732546 
Testing  MSE:  302.64091133 

We  can  see  that  all  the  coefficients  obtained  are  different  from  zero,  meaning  that 
no  variable  is  discarded.  Next,  we  try  to  build  a  sparse  model  to  predict  the  price 
using  the  most  important  factors  and  discarding  the  non-informative  ones.  To  do  this, 
we  can  create  a  LASSO  regressor,  forcing  zero  coefficients. 


6.2  Linear  Regression 


109 


In  [15] : 


regr_lasso  =  linear_model . Lasso ( alpha  =  .3) 

regr_lasso .fit (X_train ,  y_train)  print  ' Coef f  and  intercept 
' , regr_lasso . coef_ 

print  ' Tesing  Score:',  regr_lasso. score ( X_tes t , 
y_test)  print  'Training  MSE :  ' , 

np .mean  (  ( regr_lasso . predict  (X_train)  -  y_t r ai n )  *  *  2 ) 

print  'Testing  MSE:  ' , 

np.mean ( ( regr_lasso . predict (X_test)  -  y_test ) **2 ) 


Out [ 15 ] : Coef f  and  intercept:  [  0.  0.01996512  -0.  0.  -0.  7.69894744 
-0.03444803  -0.79380636  0.0735163  -0.0143421  -0.66768539 
0 . 0154743  7  -0.22181817]  -6.18324183  615 
Testing  Score:  0.501127529021 
Training  MSE:  10.7343110095 
Testing  MSE:  46.5381680949 

It  can  now  be  seen  that  the  result  of  the  model  fitting  for  a  set  of  sparse  coefficients 
is  much  better  than  before  (using  all  the  variables),  with  the  score  increasing  from 
—2.24  to  0.5.  This  demonstrates  that  four  of  the  initial  variables  are  not  important 
for  the  prediction  and  in  fact  they  confuse  the  regressor. 

With  the  LASSO  result,  we  can  also  emphasize  the  most  important  factors  for 
determining  the  price  of  a  new  market,  based  on  the  coefficient  values: 


\ 

ind  = 

np . argsort ( np . abs (regr_lasso . coef_) ) 

print 

'Ordered  variable  (from  less  to  more  important) : ' , 

boston. f ea tur e_names [ind] 

_ y 

Out [16] : Ordered  variable  (from  less  to  more  important) :  ['CRIM'  'INDUS' 
'CHAS'  'NOX'  'TAX'  'B'  'ZN'  'AGE'  'RAD'  ' LSTAT '  ' PTRATIO '  'DIS' 

'RM'  ] 


In  [17] 


There  are  also  other  strategies  for  feature  selection.  For  instance,  we  can  select 
the  k  =  5  best  features,  according  to  the  k  highest  scores,  using  the  function 
Select  KB  est  from  Scikit-learn: 


/ 

\ 

import  s kl ear n . f ea t u r e_s e 1 e c t i on 

as  f  s 

selector  =  fs . SelectKBest ( score_ 

func  =  f s . f _r egr es s i on , 

k  =  5 ) 

selector  .  fit_transform ( X_t r a in  , 

y_t r a in )  per 

selector  .  fit  ( X_t r a in  , y_t r a i n ) 

print  'Selected  features:', 

zip ( selector . get_support ( ) 

,  boston . feature_names ) 

V 

y 

Out [ 17 ]:  Selected  features:  [(False,  'CRIM'),  (False,  'ZN'),  (True, 

'INDUS'),  (False,  'CHAS'),  (False,  'NOX'),  (True,  'RM'),  (True, 
'AGE'),  (False,  'DIS'),  (False,  'RAD'),  (False,  'TAX'),  (True, 
'PTRATIO'),  (False,  ' B'),  (True,  'LSTAT')] 

The  set  of  selected  features  is  now  different,  since  the  criterion  has  changed. 
However,  three  of  the  most  important  features:  RM,  PTRATIO,  and  LSTAT. 

In  order  to  evaluate  the  prediction,  it  could  be  interesting  to  visualize  the  target 
and  predicted  responses  in  a  scatter  plot,  as  it  is  done  in  the  next  code: 
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Fig.  6.9  Relation  between  true  (x-axis)  and  predicted  (y-axis)  prices 


The  output  is  shown  in  Fig.  6.9,  where  we  can  observe  that  the  original  prices 
are  properly  estimated  by  the  predicted  ones,  except  for  the  higher  values,  around 
$50,000  (points  in  the  top  right  corner). 

Finally,  it  is  worth  noting  that  we  can  work  with  statistical  evaluation  of  a  linear 
regression  with  the  OLS  toolbox  of  the  Stats  Model  toolbox.  This  toolbox  is  useful 
to  study  several  statistics  concerning  the  regression  model.  To  know  more  about  the 
toolbox,  go  to  the  Documentation  related  to  Stats  Models. 


6.3  Logistic  Regression 


Logistic  regression  is  a  type  of  model  of  probabilistic  statistical  classification.  It  is 
used  as  a  binary  model  to  predict  a  binary  response,  the  outcome  of  a  categorical 
dependent  variable  (i.e.,  a  class  label),  based  on  one  or  more  variables. 

The  form  of  the  logistic  function  is: 


fix)  = 


1 

1  +  e~Xx 


4http://  statsmodels.sourceforge.net/devel/examples/notebooks/generated/ols.html. 


6.3  Logistic  Regression 
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Fig.  6.1 0  Logistic  function  for  different  lambda  values 

2.0 


-1.0 

-3-2-10  1  2  3 

x 

Fig.  6.1 1  Linear  regression  (blue)  versus  logistic  regression  (red)  for  fitting  a  set  of  data  (black 
points)  normally  distributed  across  the  0  and  1  y-values 


Figure 6. 10  illustrates  the  logistic  function  with  different  values  of  A.  This  function 
is  useful  because  it  can  take  as  its  input  any  value  from  negative  infinity  to  positive 
infinity,  whereas  the  output  is  restricted  to  values  between  0  and  1  and  hence  can  be 
interpreted  as  a  probability. 

The  set  of  samples  (X,  y),  illustrated  as  black  points  in  Fig.  6.11,  defines  a  fitting 
problem  suitable  for  a  logistic  regression.  The  blue  and  red  lines  show  the  fitting 
result  for  linear  and  logistic  models,  respectively.  In  this  case,  a  logistic  model  can 
clearly  explain  the  data;  whereas  a  linear  model  cannot. 

Practical  Case:  Winning  or  Losing  Football  Team 

Now,  we  pose  the  question:  What  number  of  goals  makes  a  football  team  the  winner 
or  the  loser?  More  concretely,  we  want  to  predict  victory  or  defeat  in  a  football 
match  when  we  are  given  the  number  of  goals  a  team  scores.  To  do  this  we  consider 
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In  [19] 


the  set  of  results  of  the  football  matches  from  the  Spanish  league  and  we  build  a 
classification  model  with  it. 

We  first  read  the  data  file  in  a  DataFrame  and  select  the  following  columns 
in  a  new  DataFrame:  HomeTeam,  AwayTeam,  FTHG  (home  team  goals),  FTAG 
(away  team  goals),  and  FTR  (H=home  win,  D  =  draw,  A  =  away  win).  We  then  build 
a  d-dimensional  vector  of  variables  with  all  the  scores,  x,  and  a  binary  response 
indicating  victory  or  defeat,  y.  For  that,  we  create  two  extra  columns  containing  W 
the  number  of  goals  of  the  winning  team  and  L  the  number  of  goals  of  the  losing 
team  and  we  concatenate  these  data.  Finally,  we  can  compute  and  visualize  a  logistic 
regression  model  to  predict  the  discrete  value  (victory  or  defeat)  using  these  data. 

r  \ 

from  ski earn . 1 inear_mode 1  import  LogisticRegression 
data  =  pd .  r ead_c s v  (  '  f i 1 e s / chO 6 / S P 1  .  c s v  '  ) 

s  =  data [[' HomeTeam AwayTeam ' ,  'FTHG',  'FTAG',  'FTR']] 
de  f  my_f 1  ( row )  : 

return  max ( row  [  ' FTHG '  ]  ,  row [  ' FTAG '  ]  ) 
de  f  my_f  2  ( row )  : 

return  min ( row  [  ' FTHG '  ]  ,  row [  ' FTAG '  ]  ) 
s['W']  =  s . apply (my_f 1 ,  axis  =  1) 
s [ ' L ' ]  =  s . apply ( my_f 2 ,  axis  =  1 ) 
xl  =  s [ ' W ' ] . values 

yl  =  np  .  ones  (  len  (xl  )  ,  dtype  =  np  .  int  ) 
x2  =  s [ ' L ' ] . values 

y2  =  np  .  zeros  (  len  (  x2  )  ,  dtype  =  np  .  int  ) 
x  =  np . concatenate ([ xl ,  x2 ] ) 
x  =  x [ : ,  np . newaxis ] 
y  =  np  .  concatenate  ([ yl  ,  y2  ]  ) 
logreg  =  L og i s t i c Re gr e s s i on ( ) 

1 ogr eg . f i t ( x ,  y ) 

X_test  =  np  .  1 inspace  ( -5 ,  10,  3  00) 

def  lr_model (x) : 

return  1  /  (1  +  np  .  exp  (  - x  )  ) 

loss  =  lr_model ( X_test * logreg . coef_  +  logreg . intercept_ ) 

. rave  1  (  ) 

X_t e s  1 2  =  X_t e s  t  [  :  , np . newaxis ] 

losspred  =  logreg . predict (X_test2 ) 
pit. scatter (x. ravel () ,  y, 

color  =  ' black ' , 

s  =  100,  zorder  =  20, 

alpha  =  0.03) 

pit . plot  (X_test  ,  loss,  color  =  'blue',  linewidth  =  3) 

pit . plot  (X_test  ,  losspred,  color  =  'red',  linewidth  =  3) 

\ _ / 


Figure  6. 12  shows  a  scatter  plot  with  transparency  so  we  can  appreciate  the  over¬ 
lapping  in  the  discrete  positions  of  the  total  numbers  of  victories  and  defeats.  It 
also  shows  the  fitting  of  the  logistic  regression  model,  in  blue,  and  prediction  of  the 
logistic  regression  model,  in  red,  for  the  Spanish  football  league  results.  With  this 
information  we  can  estimate  that  the  cutoff  value  is  1 .  This  means  that  a  team,  in 
general,  has  to  score  more  than  one  goal  to  win. 


5  http://www.football-  data.co.uk/mmz428 1/121 3/SP 1  .csv. 
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Number  of  goals 


Fig.  6.1 2  Fitting  of  the  logistic  regression  model  {blue)  and  prediction  of  the  logistic  regression 
model  (red)  for  the  Spanish  football  league  results 


6.4  Conclusions 

In  this  chapter,  we  have  focused  on  regression  analysis  and  the  different  Python  tools 
that  are  useful  for  performing  it.  We  have  shown  how  regression  analysis  allows  us 
to  better  understand  data  by  means  of  building  a  model  from  it.  We  have  formally 
presented  four  different  regression  models:  simple  linear  regression,  multiple  linear 
regression,  polynomial  regression,  and  logistic  regression.  We  have  also  emphasized 
the  properties  of  sparse  models  in  the  selection  of  variables. 

The  different  models  have  been  used  in  three  real  problems  dealing  with  different 
types  of  datasets.  In  these  practical  cases,  we  solve  different  questions  regarding 
the  behavior  of  the  data,  the  prediction  of  data  values  (continuous  or  discrete),  and 
the  importance  of  variables  for  the  model.  In  the  first  case,  we  showed  that  there 
is  a  decreasing  tendency  in  the  sea  ice  extent  over  the  years,  and  we  also  predicted 
the  amount  of  ice  for  the  next  20  years.  In  the  second  case,  we  predicted  the  price 
of  a  market  given  a  set  of  attributes  and  distinguished  which  of  the  attributes  were 
more  important  in  the  prediction.  Moreover,  we  presented  a  useful  way  to  show 
the  correlation  between  pairs  of  variables,  as  well  as  a  way  to  plot  the  relationship 
between  pairs  of  variables.  In  the  third  case,  we  faced  the  problem  of  predicting 
victory  or  defeat  in  a  football  match  given  the  score  of  a  team.  We  posed  this  problem 
as  a  classification  problem  and  solved  it  using  a  logistic  regression  model;  and  we 
estimated  the  minimum  number  of  goals  a  team  has  to  score  to  win. 

Acknowledgements  This  chapter  was  co-written  by  Laura  Igual  and  Jordi  Vitria. 
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Unsupervised  Learning 


7.1  Introduction 

In  machine  learning,  the  problem  of  unsupervised  learning  is  that  of  trying  to  find 
hidden  structure  in  unlabeled  data.  Since  the  examples  given  to  the  learner  are  unla¬ 
beled,  there  is  no  error  or  reward  signal  to  evaluate  the  goodness  of  a  potential 
solution.  This  distinguishes  unsupervised  from  supervised  learning.  Unsupervised 
learning  is  defined  as  the  task  performed  by  algorithms  that  learn  from  a  training  set 
of  unlabeled  or  unannotated  examples,  using  the  features  of  the  inputs  to  categorize 
them  according  to  some  geometric  or  statistical  criteria. 

Unsupervised  learning  encompasses  many  techniques  that  seek  to  summarize  and 
explain  key  features  or  structures  of  the  data.  Many  methods  employed  in  unsuper¬ 
vised  learning  are  based  on  data  mining  methods  used  to  preprocess  data.  Most 
unsupervised  learning  techniques  can  be  summarized  as  those  that  tackle  the  follow¬ 
ing  four  groups  of  problems: 

•  Clustering :  has  as  a  goal  to  partition  the  set  of  examples  into  groups. 

•  Dimensionality  reduction :  aims  to  reduce  the  dimensionality  of  the  data.  Here,  we 
encounter  techniques  such  as  Principal  Component  Analysis  (PC A),  independent 
component  analysis,  and  nonnegative  matrix  factorization. 

•  Outlier  detection :  has  as  a  purpose  to  find  unusual  events  (e.g.,  a  malfunction), 
that  distinguish  part  of  the  data  from  the  rest  according  to  certain  criteria. 

•  Novelty  detection :  deals  with  cases  when  changes  occur  in  the  data  (e.g.,  in  stream¬ 
ing  data). 

The  most  common  unsupervised  task  is  clustering,  which  we  focus  on  in  this 
chapter. 
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7.2  Clustering 

Clustering  is  a  process  of  grouping  similar  objects  together;  i.e.,  to  partition  unlabeled 
examples  into  disjoint  subsets  of  clusters,  such  that: 

•  Examples  within  a  cluster  are  similar  (in  this  case,  we  speak  of  high  intraclass 
similarity). 

•  Examples  in  different  clusters  are  different  (in  this  case,  we  speak  of  low  interclass 
similarity). 

When  we  denote  data  as  similar  and  dissimilar,  we  should  define  a  measure  for  this 
similarity/dissimilarity.  Note  that  grouping  similar  data  together  can  help  in  discov¬ 
ering  new  categories  in  an  unsupervised  manner,  even  when  no  sample  category 
labels  are  provided.  Moreover,  two  kinds  of  inputs  can  be  used  for  grouping: 

(a)  in  similarity-based  clustering ,  the  input  to  the  algorithm  is  an  n  x  n  dissimilarity 
matrix  or  distance  matrix ; 

(b)  in  feature -based  clustering ,  the  input  to  the  algorithm  is  an  n  x  D  feature  matrix 
or  design  matrix ,  where  n  is  the  number  of  examples  in  the  dataset  and  D  the 
dimensionality  of  each  sample. 

Similarity-based  clustering  allows  easy  inclusion  of  domain-specific  similarity, 
while  feature-based  clustering  has  the  advantage  that  it  is  applicable  to  potentially 
noisy  data. 

Therefore,  several  questions  regarding  the  clustering  process  arise. 

•  What  is  a  natural  grouping  among  the  objects?  We  need  to  define  the  “groupness” 
and  the  “similarity /distance”  between  data. 

•  How  can  we  group  samples?  What  are  the  best  procedures?  Are  they  efficient? 
Are  they  fast?  Are  they  deterministic? 

•  How  many  clusters  should  we  look  for  in  the  data?  Shall  we  state  this  number 
a  priori?  Should  the  process  be  completely  data  driven  or  can  the  user  guide  the 
grouping  process?  How  can  we  avoid  “trivial”  clusters?  Should  we  allow  final 
clustering  results  to  have  very  large  or  very  small  clusters?  Which  methods  work 
when  the  number  of  samples  is  large?  Which  methods  work  when  the  number  of 
classes  is  large? 

•  What  constitutes  a  good  grouping?  What  objective  measures  can  be  defined  to 
evaluate  the  quality  of  the  clusters? 

There  is  not  always  a  single  or  optimal  answer  to  these  questions.  It  used  to  be  said 
that  clustering  is  a  “subjective”  issue.  Clustering  will  help  us  to  describe,  analyze, 
and  gain  insight  into  the  data,  but  the  quality  of  the  partition  depends  to  a  great  extent 
on  the  application  and  the  analyst. 
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7.2  Clustering 

7.2.1  Similarity  and  Distances 

To  speak  of  similar  and  dissimilar  data,  we  need  to  introduce  a  notion  of  the  similarity 
of  data.  There  are  several  ways  for  modeling  of  similarity.  A  simple  way  to  model 
this  is  by  means  of  a  Gaussian  kernel: 

s(a,  b )  =  e-^d(a’b) 

where  d(a,  b)  is  a  metric  function,  and  7  is  a  constant  that  controls  the  decay  of  the 
function.  Observe  that  when  a  =  b,  the  similarity  is  maximum  and  equal  to  one.  On 
the  contrary,  when  a  is  very  different  to  b ,  the  similarity  tends  to  zero.  The  former 
modeling  of  the  similarity  function  suggests  that  we  can  use  the  notion  of  distance 
as  a  surrogate.  The  most  widespread  distance  metric  is  the  Minkowski  distance : 

d 

d(a,  b )  =  | at  -  bi\p)xlp 

/—I 

where  d(a,b )  stands  for  the  distance  between  two  elements  a ,  b  e  Rd ,  d  is  the 
dimensionality  of  the  data,  and  p  is  a  parameter. 

The  best-known  instantiations  of  this  metric  are  as  follows: 

•  when  p  =  2,  we  have  the  Euclidean  distance , 

•  when  p  =  1,  we  have  the  Manhattan  distance ,  and 

•  when  p  —  inf,  we  have  the  max-distance.  In  this  case,  the  distance  corresponds  to 
the  component  | at  —  bi\  with  the  highest  value. 


7.2.2  What  Constitutes  a  Good  Clustering?  Defining  Metrics 
to  Measure  Clustering  Quality 

When  performing  clustering,  the  question  normally  arises:  How  do  we  measure  the 
quality  of  the  clustering  result?  Note  that  in  unsupervised  clustering,  we  do  not  have 
groundtruth  labels  that  would  allow  us  to  compute  the  accuracy  of  the  algorithm.  Still, 
there  are  several  procedures  for  assessing  quality.  We  find  two  families  of  techniques: 
those  that  allow  us  to  compare  clustering  techniques,  and  those  that  check  on  specific 
properties  of  the  clustering,  for  example  “compactness”. 


7.2.2. 1  Rand  Index,  Homogeneity,  Completeness  and  V-measure 
Scores 

One  of  the  best-known  methods  for  comparing  the  results  in  clustering  techniques 
in  statistics  is  the  Rand  index  or  Rand  measure  (named  after  William  M.  Rand).  The 
Rand  index  evaluates  the  similarity  between  two  results  of  data  clustering.  Since 
in  unsupervised  clustering,  class  labels  are  not  known,  we  use  the  Rand  index  to 
compare  the  coincidence  of  different  clusterings  obtained  by  different  approaches 
or  criteria.  As  an  alternative,  we  later  discuss  the  Silhouette  coefficient :  instead  of 
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In  [1]  : 


Out  [  1  ]  : 


comparing  different  clusterings,  this  evaluates  the  compactness  of  the  results  of 
applying  a  specific  clustering  approach. 

Given  a  set  of  n  elements  S  =  {o\ , . . . ,  on),  we  can  compare  two  partitions  of  S  : 
X  =  {X\ ,  ... ,  Xr },  a  partition  of  S  into  r  subsets;  and  Y  =  [Y\ ,  . . . ,  ,  FJ,  a  partition 
of  S  into  s  subsets.  Let  us  use  the  annotations  as  follows: 

•  a  is  the  number  of  pairs  of  elements  in  S  that  are  in  the  same  subset  in  both  X  and 

y ; 

•  b  is  the  number  of  pairs  of  elements  in  S  that  are  in  different  subsets  in  both  X  and 

y ; 

•  c  is  the  number  of  pairs  of  elements  in  S  that  are  in  the  same  subset  in  X ,  but  in 
different  subsets  in  Y ;  and 

•  d  is  the  number  of  pairs  of  elements  in  S  that  are  in  different  subsets  in  X ,  but  in 
the  same  subset  in  Y . 

The  Rand  index,  R ,  is  defined  as  follows: 

a  +  b 

R=  - , 

a  T  b  T  c  T  d 

ensuring  that  its  value  is  between  0  and  1 . 

One  of  the  problems  of  the  Rand  index  is  that  when  given  two  datasets  with  random 
labelings,  it  does  not  take  a  constant  value  (e.g.,  zero)  as  expected.  Moreover,  when 
the  number  of  clusters  increases  it  is  desirable  that  the  upper  limit  tends  to  the  unity. 
To  solve  this  problem,  a  form  of  the  Rand  index,  called  the  Adjusted  Rand  index ,  is 
used  that  adjusts  the  Rand  index  with  respect  to  chance  grouping  of  elements.  It  is 
defined  as  follows: 

(2)^  +  d)  —  [( a  +  b)(a  +  c)  +  (c  +  d)(b  +  d)] 

AR  =  - -  . 

(2)  [(^  "T  b)(a  +  c)  +  (c  +  d)(b  +  d)] 

Another  way  for  comparing  clustering  results  is  the  V-measure.  Let  us  first  intro¬ 
duce  some  concepts.  We  say  that  a  clustering  result  satisfies  a  homogeneity  criterion 
if  all  of  its  clusters  contain  only  data  points  which  are  members  of  the  same  original 
(single)  class.  A  clustering  result  satisfies  a  completeness  criterion  if  all  the  data 
points  that  are  members  of  a  given  class  are  elements  of  the  same  predicted  cluster. 
Note  that  both  scores  have  real  positive  values  between  0.0  and  1.0,  larger  values 
being  desirable.  For  example,  if  we  consider  two  toy  clustering  sets  (e.g.,  original 
and  predicted)  with  four  samples  and  two  labels,  we  get: 


r 

print  (  " % . 3  f "  %  metrics  . homogeneity_score  (  [0 , 

0  , 

1 , 

1]  , 

N 

[0  , 

0 , 

0 , 

0]  )  ) 

\ 

_ y 

0 .000 


1  https  ://en.  wikipedia.org/wiki/Rand_index. 
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In  [2]  : 


Out  [  2  ]  : 


In  [3]  : 


Out  [  3  ]  : 


In  [4]  : 


Out  [  4  ]  : 


The  homogeneity  is  0  since  the  samples  in  the  predicted  cluster  0  come  from 
original  cluster  0  and  cluster  1 . 


print  metrics,  c  omp leteness_score  (  [0 , 

0  , 

1  , 

1]  , 

\ 

[1  , 

1  , 

0 , 

0]  ) 

\ _ 

_ y 

1.0 

The  completeness  is  1  since  all  the  samples  from  the  original  cluster  with  label  0 
go  into  the  same  predicted  cluster  with  label  1 ,  and  all  the  samples  from  the  original 
cluster  with  label  1  go  into  the  same  predicted  cluster  with  label  0. 

However,  how  can  we  define  a  measure  that  takes  into  account  the  completeness 
as  well  as  the  homogeneity?  The  V-measure  is  the  harmonic  mean  between  the 
homogeneity  and  the  completeness  defined  as  follows: 

v  =  2  *  (homogeneity  *  completeness) /(homogeneity  +  completeness). 

Note  that  this  metric  is  not  dependent  of  the  absolute  values  of  the  labels:  a 
permutation  of  the  class  or  cluster  label  values  will  not  change  the  score  value  in 
any  way.  Moreover,  the  metric  is  symmetric  with  respect  to  switching  between  the 
predicted  and  the  original  cluster  label.  This  is  very  useful  to  measure  the  agreement 
of  two  independent  label  assignment  strategies  applied  to  the  same  dataset  even 
when  the  real  groundtruth  is  not  known.  If  class  members  are  completely  split  across 
different  clusters,  the  assignment  is  totally  incomplete,  hence  the  V-measure  is  null: 


f 

print  (  " %  .  3  f  "  %  metrics  . v_measure_score  ([ 0 , 

0  , 

0 , 

0]  , 

\ 

[0  , 

1  , 

2  , 

3  ]  )  ) 

K 

_ y 

0 .000 


In  contrast,  clusters  that  include  samples  from  different  classes  destroy  the  homo¬ 
geneity  of  the  labeling,  hence: 


r 

print  (  " %  .  3  f  "  %  metrics  . v_measure_score  ([ 0 , 

0  , 

1 , 

1]  , 

N 

[0  , 

0 , 

0 , 

0]  )  ) 

K 

_ y 

0 .000 

In  summary,  we  can  say  that  the  advantages  of  the  V-measure  include  that  it 
has  bounded  scores:  0.0  means  the  clustering  is  extremely  bad;  1.0  indicates  a  per¬ 
fect  clustering  result.  Moreover,  it  can  be  interpreted  easily:  when  analyzing  the 
V-measure,  low  completeness  or  homogeneity  explain  in  which  direction  the  clus¬ 
tering  is  not  performing  well.  Furthermore,  we  do  not  assume  anything  about  the 
cluster  structure.  Therefore,  it  can  be  used  to  compare  clustering  algorithms  such 
as  K-means ,  which  assume  isotropic  blob  shapes,  with  results  of  other  clustering 
algorithms  such  as  spectral  clustering  (see  Sect.  7. 2. 3. 2),  which  can  find  clusters 
with  “folded”  shapes.  As  a  drawback,  the  previously  introduced  metrics  are  not 
normalized  with  regard  to  random  labeling.  This  means  that  depending  on  the  num¬ 
ber  of  samples,  clusters  and  groundtruth  classes,  a  completely  random  labeling  will 
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not  always  yield  the  same  values  for  homogeneity,  completeness  and  hence,  the  V- 
measure.  In  particular,  random  labeling  will  not  yield  a  zero  score,  and  they  will  tend 
further  from  zero  as  the  number  of  clusters  increases.  It  can  be  shown  that  this  prob¬ 
lem  can  reliably  be  overcome  when  the  number  of  samples  is  high,  i.e.,  more  than  a 
thousand,  and  the  number  of  clusters  is  less  than  10.  These  metrics  require  knowl¬ 
edge  of  the  groundtruth  classes,  while  in  practice  this  information  is  almost  never 
available  or  requires  manual  assignment  by  human  annotators.  Instead,  as  mentioned 
before,  these  metrics  can  be  used  to  compare  the  results  of  different  clusterings. 


7. 2.2.2  Silhouette  Score 

An  alternative  to  the  former  scores  is  to  evaluate  the  final  ‘shape’  of  the  clustering 
result.  This  is  the  underlying  idea  behind  the  Silhouette  coefficient.  It  is  defined  as 
a  function  of  the  intracluster  distance  of  a  sample  in  the  dataset,  a  and  the  nearest- 
cluster  distance,  b  for  each  sample.  Later,  we  will  discuss  different  ways  to  compute 
the  distance  between  clusters.  The  Silhouette  coefficient  for  a  sample  i  can  be  written 
as  follows: 

b  —  a 

Silhouette  (/)  =  - . 

max  (a,  b) 

Hence,  if  the  Silhouette  s(i)  is  close  to  0,  it  means  that  the  sample  is  on  the  border  of 
its  cluster  and  the  closest  one  from  the  rest  of  the  dataset  clusters.  A  negative  value 
means  that  the  sample  is  closer  to  the  neighbor  cluster.  The  average  of  the  Silhouette 
coefficients  of  all  samples  of  a  given  cluster  defines  the  “goodness”  of  the  cluster. 
A  high  positive  value,  i.e.,  close  to  1  would  mean  a  compact  cluster,  and  vice  versa. 
And  the  average  of  the  Silhouette  coefficients  of  all  clusters  gives  idea  of  the  quality 
of  the  clustering  result.  Note  that  the  Silhouette  coefficient  only  makes  sense  when 
the  number  of  labels  predicted  is  less  than  the  number  of  samples  clustered. 

The  advantage  of  the  Silhouette  coefficient  is  that  it  is  bounded  between  —  1  and 
+  1.  Moreover,  it  is  easy  to  show  that  the  score  is  higher  when  clusters  are  dense 
and  well  separated;  a  logical  feature  when  speaking  about  clusters.  Furthermore,  the 
Silhouette  coefficient  is  generally  higher  when  clusters  are  compact. 


7.2.3  Taxonomies  of  Clustering  Techniques 

Within  different  clustering  algorithms,  one  can  find  soft  partition  algorithms,  which 
assign  a  probability  of  the  data  belonging  to  each  cluster,  and  also  hard  partition 
algorithms,  where  each  datapoint  is  assigned  precise  membership  of  one  cluster. 
A  typical  example  of  a  soft  partition  algorithm  is  the  Mixture  of  Gaussians  [1], 
which  can  be  viewed  as  a  density  estimator  method  that  assigns  a  confidence  or 


2  The  intracluster  distance  of  sample  i  is  obtained  by  the  distance  of  the  sample  to  the  nearest  sample 
from  the  same  class,  and  the  nearest-cluster  distance  is  given  by  the  distance  to  the  closest  sample 
from  the  cluster  nearest  to  the  cluster  of  sample  i. 
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probability  to  each  point  in  the  space.  A  Gaussian  mixture  model  is  a  probabilistic 
model  that  assumes  all  the  data  points  are  generated  from  a  mixture  of  a  finite 
number  of  Gaussian  distributions  with  unknown  parameters.  The  universally  used 
generative  unsupervised  clustering  using  a  Gaussian  mixture  model  is  also  known 
as  EM  Clustering.  Each  point  in  the  dataset  has  a  soft  assignment  to  the  K  clusters. 
One  can  convert  this  soft  probabilistic  assignment  into  membership  by  picking  out 
the  most  likely  clusters  (those  with  the  highest  probability  of  assignment). 

An  alternative  to  soft  algorithms  are  the  hard  partition  algorithms,  which  assign  a 
unique  cluster  value  to  each  element  in  the  feature  space.  According  to  the  grouping 
process  of  the  hard  partition  algorithm,  there  are  two  large  families  of  clustering 
techniques: 

•  Partitional  algorithms',  these  start  with  a  random  partition  and  refine  it  iteratively. 
That  is  why  sometimes  these  algorithms  are  called  “flat”  clustering.  In  this  chapter, 
we  will  consider  two  partitional  algorithms  in  detail:  K-means  and  spectral  clus¬ 
tering. 

•  Hierarchical  algorithms :  these  organize  the  data  into  hierarchical  structures,  where 
data  can  be  agglomerated  in  the  bottom-up  direction,  or  split  in  a  top-down  manner. 
In  this  chapter,  we  will  discuss  and  illustrate  agglomerative  clustering. 

A  typical  hard  partition  algorithm  is  K-means  clustering.  We  will  now  discuss  it 
in  some  detail. 


7.2.3. 1  K-means  Clustering 

K-means  algorithm  is  a  hard  partition  algorithm  with  the  goal  of  assigning  each  data 
point  to  a  single  cluster.  K-means  algorithm  divides  a  set  of  n  samples  X  into  k 
disjoint  clusters  c/,  i  =  1  each  described  by  the  mean  pi  of  the  samples  in  the 

cluster.  The  means  are  commonly  called  cluster  centroids.  The  K-means  algorithm 
assumes  that  all  k  groups  have  equal  variance. 

K-means  clustering  solves  the  following  minimization  problem: 

k  k 

arg minc  ^  ^ d(x,  /i7)  =  arg minc  ^  ^  II*  -  M/II2  (7- 1) 

j=  l  x€cj  j=  l  x€cj 

where  c\  is  the  set  of  points  that  belong  to  cluster  i  and  pi  is  the  center  of  the  class 
q.  K-means  clustering  objective  function  uses  the  square  of  the  Euclidean  distance 
d(x,  pj)  =  \\x  —  pj\\2,  that  is  also  referred  to  as  the  inertia  or  within- cluster  sum- 
of-squares.  This  problem  is  not  trivial  to  solve  (in  fact,  it  is  NP-hard  problem),  so 
the  algorithm  only  hopes  to  find  the  global  minimum,  but  may  become  stuck  at  a 
different  solution. 

In  other  words,  we  may  wonder  whether  the  centroids  should  belong  to  the  original 
set  of  points: 

n 

inertia  =  E  min^edWxi  -  M/||2))-  (7.2) 
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In  [  5 


The  K-means  algorithm ,  also  known  as  Lloyd’s  algorithm,  is  an  iterative  procedure 
that  searches  for  a  solution  of  the  K-means  clustering  problem  and  works  as  follows. 
First,  we  need  to  decide  the  number  of  clusters,  k.  Then  we  apply  the  following 
procedure: 

1.  Initialize  (e.g.,  randomly)  the  k  cluster  centers,  called  centroids. 

2.  Decide  the  class  memberships  of  the  n  data  samples  by  assigning  them  to  the 
nearest-cluster  centroids  (e.g.,  the  center  of  gravity  or  mean). 

3.  Re-estimate  the  k  cluster  centers,  q,  by  assuming  the  memberships  found  above 
are  correct. 

4.  If  none  of  the  n  objects  changes  its  membership  from  the  last  iteration,  then  exit. 
Otherwise  go  to  step  2. 

Let  us  illustrate  the  algorithm  in  Python.  First,  we  will  create  three  sample  distri¬ 
butions: 

r - \ 

MAXN  =  40 

X  =  np .  concatenate  (  [ 

1.25*  np .  random .  randn ( MAXN  ,  2)  , 

5  +  1.5*  np  .  random  .  randn  ( MAXN  ,  2  )  ]  ) 

X  =  np . concatenate ( [ 

X,  [8,  3]  +  1.2*  np .  random .  randn  ( MAXN ,  2 )  ]  ) 

\ _ / 

The  sample  distributions  generated  are  shown  in  Fig.  7.1  (left).  However,  the  algo¬ 
rithm  is  not  aware  of  their  distribution.  Figure 7.1  (right)  shows  what  the  algorithm 
sees.  Let  us  assume  that  we  expect  to  have  three  clusters  (k  =  3)  and  apply  the 
K-means  command  from  the  Scikit-learn  library: 
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Data  as  the  algorithm  sees  them 
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Fig.  7.1  Initial  samples  as  generated  (left),  and  samples  seen  by  the  algorithm  (right) 
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In  [  6  ]  : 


Out  [  6  ]  : 


In  [7]  : 


r 

\ 

f  r  om 

sklearn  import  cluster 

K  = 

3  #  Assuming  we  have  3  clusters! 

elf 

=  c lus t e r . KMeans ( i n i t  =  'random',  n_clusters  =  K) 

elf  . 

fit ( X ) 

v 

_ 

KMeans ( copy_x=True ,  init= 7  random ' ,  max_iter=3 0 0 , 
n_clusters=3 ,  n_init=10/  n_jobs=l,  precompute_distances=True , 
random_state=None ,  tol=0.0001/  verbose=0) 

Each  clustering  algorithm  in  Scikit-learn  is  used  as  follows.  First,  an  object  from 
the  clustering  technique  is  instantiated.  Then  we  can  use  the  fit  method  to  adjust 
the  learning  parameters.  We  also  find  the  method  predict  that,  given  new  data, 
returns  the  cluster  they  belong  to.  For  the  class,  the  labels  over  the  training  data  can 
be  found  in  the  labels_  attribute  or  alternatively  they  can  be  obtained  using  the 
predict  method. 

How  many  “mis-clusterings”  do  we  have?  In  order  to  see  this,  we  tessellate  the 
space  and  color  all  grid  points  from  the  same  cluster  with  the  same  color.  Then,  we 
overlay  the  initial  sample  distributions  (see  Fig.  7.2).  In  the  ideal  case,  we  expect  that 
in  each  partitioned  subspace  the  sample  points  are  of  the  same  color.  However,  as 
shown  in  Fig.  7.2,  the  resulting  clustering,  which  is  represented  in  the  figure  by  the 
color  subspace  in  gray,  does  not  usually  coincide  exactly  with  the  initial  distribution, 
which  is  represented  by  the  color  of  the  data.  For  example,  in  the  same  figure,  if  most 
of  the  blue  points  belong  to  the  same  cluster,  there  are  a  few  ones  that  belong  to  the 
space  occupied  by  the  green  data. 

When  computing  the  Rand  index,  we  get: 

C  \ 

print  ('The  Adjusted  Rand  index  is:  % . 2  f '  % 

metrics  .  adjus  t  ed_r and_s  c  ore  (y. ravel  ()  ,  elf.  labels_)  ) 

V _ > 


Fig.  7.2  Original  samples 
(dots)  generated  by  three 
distributions  and  the  partition 
of  the  space  according  to  the 
K-means  clustering 
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Final  result  of  K-means 
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Out [7]:  The  Adjusted  Rand  index  is:  0.66 

Taking  into  account  that  the  Adjusted  Rand  index  belongs  to  the  interval  [0,  1], 
the  result  of  0.66  in  our  example  means  that  although  most  of  the  clusters  were 
discovered,  not  100%  of  them  were;  as  confirmed  by  Fig.  7.2. 

The  inertia  can  be  seen  as  a  measure  of  how  internally  coherent  the  clusters  are. 
Several  issues  should  be  taken  into  account: 

•  The  inertia  assumes  that  clusters  are  isotropic  and  convex,  since  the  Euclidean 
distance  is  applied,  which  is  isotropic  with  regard  to  the  different  dimensions  of 
the  data.  However,  we  cannot  expect  that  the  data  fulfill  this  assumption  by  default. 
Hence,  the  K-means  algorithm  responds  poorly  to  elongated  clusters  or  manifolds 
with  irregular  shapes. 

•  The  algorithm  may  not  ensure  convergence  to  the  global  minimum.  It  can  be 
shown  that  K-means  will  always  converge  to  a  local  minimum  of  the  inertia 
(Eq.  (7.2)).  It  depends  on  the  random  initialization  of  the  seeds,  but  some  seeds 
can  result  in  a  poor  convergence  rate,  or  convergence  to  suboptimal  clustering. 
To  alleviate  the  problem  of  local  minima,  the  K-means  computation  is  often  per¬ 
formed  several  times,  with  different  centroid  initializations.  One  way  to  address 
this  issue  is  the  k-means +  +  initialization  scheme,  which  has  been  implemented 
in  Scikit-learn  (use  the  init=  '  kmeans  +  +  '  parameter).  This  parameter 
initializes  the  centroids  to  be  (generally)  far  from  each  other,  thereby  probably 
leading  to  better  results  than  random  initialization. 

•  This  algorithm  requires  the  number  of  clusters  to  be  specified.  Different  heuristics 
can  be  applied  to  predetermine  the  number  of  seeds  of  the  algorithm. 

•  It  scales  well  to  a  large  number  of  samples  and  has  been  used  across  a  large  range 
of  application  areas  in  many  different  fields. 

In  summary,  we  can  conclude  that  K-means  has  the  advantages  of  allowing  the 
easy  use  of  heuristics  to  select  good  seeds;  initialization  of  seeds  by  other  methods; 
multiple  points  to  be  tried.  However,  in  contrast,  it  still  cannot  ensure  that  the  local 
minima  problem  is  overcome;  it  is  iterative  and  hence  slow  when  there  are  a  lot  of 
high-dimensional  samples;  and  it  tends  to  look  for  spherical  clusters. 


7.23.2  Spectral  Clustering 

Up  to  this  point,  the  clustering  procedure  has  been  considered  as  a  way  to  find  data 
groups  following  a  notion  of  compactness.  Another  way  of  looking  at  what  a  cluster 
is  is  provided  by  connectivity  (or  similarity).  Spectral  clustering  [2]  refers  to  a  family 
of  methods  that  use  spectral  techniques.  Specifically,  these  techniques  are  related  to 
the  eigendecomposition  of  an  affinity  or  similarity  matrix  and  solve  the  problem  of 
clustering  according  to  the  connectivity  of  the  data.  Let  us  consider  an  ideal  similarity 
matrix  of  two  clear  sets. 

Let  us  denote  the  similarity  matrix,  S ,  as  the  matrix  S[j  =  sfa ,  xj)  which  gives  the 
similarity  between  observations  v;  and  xj.  Remember  that  we  can  model  similarity 
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using  the  Euclidean  distance,  d  xj)  =  \  \ Xi  —  xj  1 1 2 ,  by  means  of  a  Gaussian  Kernel 
as  follows: 

s(xi,Xj )  =  exp(—a\\xi  -x/||2), 

where  a  is  a  parameter.  We  expect  two  points  from  different  clusters  to  be  far  away 
from  each  other.  However,  if  there  is  a  sequence  of  points  within  the  cluster  that  forms 
a  “path”  between  them,  this  also  would  lead  to  big  distance  among  some  of  the  points 
from  the  same  cluster.  Hence,  we  define  an  affinity  matrix  A  based  on  the  similarity 
matrix  S ,  where  A  contains  positive  values  and  is  symmetric.  This  can  be  done,  for 
example,  by  applying  a  k-nearest  neighbor  that  builds  a  graph  connecting  just  the 
k  closest  data  points.  The  symmetry  comes  from  the  fact  that  A#  and  Ay;  give  the 
distance  between  the  same  points.  Considering  the  affinity  matrix,  the  clustering  can 
be  seen  as  a  graph  partition  problem,  where  connected  graph  components  correspond 
to  clusters.  The  graph  obtained  by  spectral  clustering  will  be  partitioned  so  that  graph 
edges  connecting  different  clusters  have  low  weights,  and  vice  versa.  Furthermore, 
we  define  a  degree  matrix  D ,  where  each  diagonal  value  is  the  degree  of  the  respective 
graph  node  and  all  other  elements  are  0.  Finally,  we  can  compute  the  unnormalized 
graph  Faplacian  (U  =  D  —  A)  and/or  a  normalized  version  of  the  Faplacian  (L),  as 
follows: 

•  Simple  Laplacian :  L  =  I  —  D~lA ,  which  corresponds  to  a  random  walk,  being 

D~l  the  transition  matrix.  Spectral  clustering  obtains  groups  of  nodes  such  that 

the  random  walk  corresponds  to  seldom  transitions  from  one  group  to  another. 

•  Normalized  Laplacian :  L  =  D~  2  UD~  2 . 

•  Generalized  Laplacian :  L  =  D  ~lu. 

If  we  assume  that  there  are  k  clusters,  the  next  step  is  to  find  the  k  small¬ 
est  eigenvectors,  without  considering  the  trivial  constant  eigenvector.  Each  row  of 
the  matrix  formed  by  the  k  smallest  eigenvectors  of  the  Faplacian  matrix  defines 
a  transformation  of  the  data  v;.  Thus,  in  this  transformed  space,  we  can  apply 
K- means  clustering  in  order  to  find  the  final  clusters.  If  we  do  not  know  in  advance 
the  number  of  clusters,  k ,  we  can  look  for  sudden  changes  in  the  sorted  eigenvalues 
of  the  matrix,  U,  and  keep  the  smallest  ones. 


7.2.3.3  Hierarchical  Clustering 

Another  well-known  clustering  technique  of  particular  interest  is  hierarchical  cluster¬ 
ing.  Hierarchical  clustering  is  comprised  of  a  general  family  of  clustering  algorithms 
that  construct  nested  clusters  by  successive  merging  or  splitting  of  data.  The  hier¬ 
archy  of  clusters  is  represented  as  a  tree.  The  tree  is  usually  called  a  dendrogram. 
The  root  of  the  dendrogram  is  the  single  cluster  that  contains  all  the  samples;  the 
leaves  are  the  clusters  containing  only  one  sample  each.  This  is  a  nice  tool,  since 
it  can  be  straightforwardly  interpreted:  it  “explains”  how  clusters  are  formed  and 
visualizes  clusters  at  different  scales.  The  tree  that  results  from  the  technique  shows 
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the  similarity  between  the  samples.  Partitioning  is  computed  by  selecting  a  cut  on 
the  tree  at  a  certain  level. 

In  general,  there  are  two  types  of  hierarchical  clustering: 

•  Top-down  divisive  clustering  applies  the  following  algorithm: 

-  Start  with  all  the  data  in  a  single  cluster. 

-  Consider  every  possible  way  to  divide  the  cluster  into  two. 

-  Choose  the  best  division. 

-  Recursively,  it  operates  on  both  sides  until  a  stopping  criterion  is  met.  That  can 
be  something  as  follows:  there  are  as  much  clusters  as  data;  the  predetermined 
number  of  clusters  has  been  reached;  the  maximum  distance  between  all  possible 
partition  divisions  is  smaller  than  a  predetermined  threshold;  etc. 

•  Bottom-up  agglomerative  clustering  applies  the  following  algorithm: 

-  Start  with  each  data  point  in  a  separate  cluster. 

-  Repeatedly  join  the  closest  pair  of  clusters. 

-  At  each  step,  a  stopping  criterion  is  checked:  there  is  only  one  cluster;  a  prede¬ 
termined  number  of  clusters  has  been  reached;  the  distance  between  the  closest 
clusters  is  greater  than  a  predetermined  threshold;  etc. 

This  process  of  merging  forms  a  binary  tree  or  hierarchy. 

When  merging  two  clusters,  a  question  naturally  arises:  How  to  measure  the 
similarity  of  two  clusters?  There  are  different  ways  to  define  this  with  different 
results  for  the  agglomerative  clustering.  The  linkage  criterion  determines  the  metric 
used  for  the  cluster  merging  strategy: 

•  Maximum  or  complete  linkage  minimizes  the  maximum  distance  between  observa¬ 
tions  of  pairs  of  clusters.  Based  on  the  similarity  of  the  two  least  similar  members 
of  the  clusters,  this  clustering  tends  to  give  tight  spherical  clusters  as  a  final  result. 

•  Average  linkage  averages  similarity  between  members,  i.e.,  minimizes  the  average 
of  the  distances  between  all  observations  of  pairs  of  clusters. 

•  Ward  linkage  minimizes  the  sum  of  squared  differences  within  all  clusters.  It  is 
thus  a  variance-minimizing  approach  and  in  this  sense  is  similar  to  the  K-means 
objective  function,  but  tackled  with  an  agglomerative  hierarchical  approach. 

Let  us  illustrate  how  the  different  linkages  work  with  an  example.  Let  us  generate 
three  clusters  as  follows: 
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In  [8]  : 


r 


MAXNl  =  500 

MAXN2  =  400 

MAXN3  =  300 

XI  =  np .  concatenate  (  [ 

2.25  *  np  .  random  .  randn  (  MAXNl  ,  2  )  , 

4  +  1.7*  np  .  random  .  randn  ( MAXN2  ,  2)  ]  ) 

XI  =  np .  concatenate  (  [ 

XI,  [8,  3]  +  1.9*  np . random .  randn ( MAXN3  ,  2 )  ]  ) 

yl  =  np .  concatenate  (  [ 

np  .  ones  (  ( MAXNl  ,  1 )  )  , 

2  *  np  .  ones  (  (  MAXN2  ,  1 )  )  ]  ) 

yl  =  np .  concatenate  (  [ 

yl  ,  3  *  np  .  ones  (  (  MAXN3  ,  1)  )]). ravel  () 

yl  =  np  .  int_  (yl  ) 
label s_y 1  =  'o'] 

colors  =  [ ' r ' ,  ' g ' ,  ' b ' ] 


Let  us  apply  agglomerative  clustering  using  the  different  linkages: 


The  results  of  the  agglomerative  clustering  using  the  different  linkages:  complete, 
average,  and  Ward  are  given  in  Fig.  7.3.  Note  that  agglomerative  clustering  exhibits 
“rich  get  richer”  behavior  that  can  sometimes  lead  to  uneven  cluster  sizes,  with 
average  linkage  being  the  worst  strategy  in  this  respect  and  Ward  linkage  giving  the 
most  regular  sizes.  Ward  linkage  is  an  attempt  to  form  clusters  that  are  as  compact 
as  possible,  since  it  considers  inter-  and  intra-distances  of  the  clusters.  Meanwhile, 
for  non-Euclidean  metrics,  average  linkage  is  a  good  alternative.  Average  linkage 
can  produce  very  unbalanced  clusters,  it  can  even  separate  a  single  data  point  into  a 
separate  cluster.  This  fact  would  be  useful  if  we  want  to  detect  outliers,  but  it  may 
be  undesirable  when  two  clusters  are  very  close  to  each  other,  since  it  would  tend  to 
merge  them. 

Agglomerative  clustering  can  scale  to  a  large  number  of  samples  when  it  is  used 
jointly  with  a  connectivity  matrix,  but  it  is  computationally  expensive  when  no  con- 
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nectivity  constraints  are  added  between  samples:  it  considers  all  the  possible  merges 
at  each  step. 


7.2.3.4  Adding  Connectivity  Constraints 

Sometimes,  we  are  interested  in  introducing  a  connectivity  constraint  into  the  clus¬ 
tering  process  so  that  merging  of  nonadjacent  points  is  avoided.  This  can  be  achieved 
by  constructing  a  connectivity  matrix  that  defines  which  are  the  neighboring  samples 
in  the  dataset.  For  instance,  in  the  example  in  Fig.  7.4,  we  want  to  avoid  the  forma¬ 
tion  of  clusters  of  samples  from  the  different  circles.  A  sample  code  to  compute 
agglomerative  clustering  with  connectivity  would  be  as  follows: 
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Fig.  7.3  Illustration  of  agglomerative  clustering  using  different  linkages:  Ward,  complete,  and 
average.  The  symbol  of  each  data  point  corresponds  to  the  original  class  generated  and  the  color 
corresponds  to  the  cluster  obtained 
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average  linkage 


Without  connectivity 

complete  linkage 


ward  linkage 


average  linkage 


With  connectivity 

complete  linkage 


ward  linkage 


Fig.  7.4  Illustration  of  agglomerative  clustering  without  ( top  row )  and  with  {bottom  row )  a  connec¬ 
tivity  graph  using  the  three  linkages  (from  left  to  right):  average,  complete,  and  Ward.  The  colors 
correspond  to  the  clusters  obtained 


In  [10] 


connectivity  =  kneighbor s_graph (X , 

3  0  ) 

\ 

model  =  AgglomerativeClustering (linkage  =  'average', 

connectivity  =  connectivity,  n_ 

.clusters  =  8) 

model . fit ( X ) 

V 

_ y 

A  connectivity  constraint  is  useful  to  impose  a  certain  local  structure,  but  it  also 
makes  the  algorithm  faster,  especially  when  the  number  of  the  samples  is  large.  A 
connectivity  constraint  is  imposed  via  a  connectivity  matrix :  a  sparse  matrix  that  only 
has  elements  at  the  intersection  of  a  row  and  a  column  with  indexes  of  the  dataset 
that  should  be  connected.  This  matrix  can  be  constructed  from  a  priori  information 
or  can  be  learned  from  the  data,  for  instance  using  kneighbors_graph  to  restrict 
merging  to  nearest  neighbors  or  using  image  .  grid_to_graph  to  limit  merging 
to  neighboring  pixels  in  an  image,  both  from  Scikit-learn.  This  phenomenon  can  be 
observed  in  Fig.  7.4,  where  in  the  first  row  we  see  the  results  of  the  agglomerative 
clustering  without  using  a  connectivity  graph.  The  clustering  can  join  data  from 
different  circles  (e.g.,  the  black  cluster).  At  the  bottom,  the  three  linkages  use  a 
connectivity  graph  and  thus  two  of  them  avoid  joining  data  points  that  belong  to 
different  circles  (except  the  Ward  linkage  that  attempts  to  form  compact  and  isotropic 
clusters). 
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KMeans  Spectral  Cist,  Average  AggL  Cist,  Ward  Aggl.  Cist, 
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Fig.  7.5  Comparison  of  the  different  clustering  techniques  (from  left  to  right):  K-means,  spectral 
clustering,  and  agglomerative  clustering  with  average  and  Ward  linkage  on  simple  compact  datasets. 
In  the,  first  row,  the  expected  number  of  clusters  is  k  =  2  and  in  the  second  row:  k  =  4 


7.23.5  Comparison  of  Different  Hard  Partition  Clustering  Algorithms 

Let  us  compare  the  behavior  of  the  different  clustering  algorithms  discussed  so  far. 
For  this  purpose,  we  generate  three  different  datasets’  configurations: 

(a)  4  spherical  groups  of  data; 

(b)  a  uniform  data  distribution;  and 

(c)  a  non-flat  configuration  of  data  composed  of  two  moon-like  groups  of  data. 

An  easy  way  to  generate  these  datasets  is  by  using  Scikit-learn  that  has 
predefined  functions  for  it:  datasets  .make_blobs  ( ) ,  datasets  .ma-  ke_ 
moons  ( ) ,  etc. 

We  apply  the  clustering  techniques  discussed  above,  namely  K-means,  agglom¬ 
erative  clustering  with  average  linkage,  agglomerative  clustering  with  Ward  linkage, 
and  spectral  clustering.  Let  us  test  the  behavior  of  the  different  algorithms  assuming 
k  =  2  and  k  =  4.  Connectivity  is  applied  in  the  algorithms  where  it  is  applicable. 

In  the  simple  case  of  separated  clusters  of  data  and  k  =  4,  most  of  the  clustering 
algorithms  perform  well,  as  expected  (see  Fig.  7.5).  The  only  algorithm  that  could 
not  discover  the  four  groups  of  samples  is  the  average  agglomerative  clustering. 
Since  it  allows  highly  unbalanced  clusters,  the  two  noisy  data  points  that  are  quite 
separated  from  the  closest  two  blobs  were  considered  as  a  different  cluster,  while  the 
two  central  blobs  were  merged  in  one  cluster.  In  case  of  k  =  2,  each  of  the  methods 
is  obligated  to  join  at  least  two  blobs  in  a  cluster. 

Regarding  the  uniform  distribution  of  data  (see  Fig.  7.6),  K-means,  Ward  linkage 
agglomerative  clustering  and  spectral  clustering  tend  to  yield  even  and  compact 
clusters;  while  the  average  linkage  agglomerative  clustering  attempts  to  join  close 
points  as  much  as  possible  following  the  “rich  get  richer”  rule.  This  results  in  a 
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KMeans  Spectral  Cist.  Average  AggL  Cist.  Ward  Aggl .  Cist. 


Fig.  7.6  Comparison  of  the  different  clustering  techniques  (from  left  to  right):  K-means,  spectral 
clustering,  and  agglomerative  clustering  with  average  and  Ward  linkage  on  uniformly  distributed 
data.  In  the  first  row,  the  number  of  clusters  assumed  is  k  =  2  and  in  the  second  row:  k  =  4 


KMeans  Spectral  Cist.  Average  Aggl.  Cist.  Ward  Aggl.  Cist. 


Fig.  7.7  Comparison  of  the  different  clustering  techniques  (from  left  to  right):  K-means,  spec¬ 
tral  clustering,  and  agglomerative  clustering  with  average  and  Ward  linkage  on  non-flat  geometry 
datasets.  In  the  first  row,  the  expected  number  of  clusters  is  k  =  2  and  in  the  second  row:  k  =  4 


second  cluster  of  a  small  set  of  data.  This  behavior  is  observed  in  both  cases:  k  =  2 
and  k  —  4. 

Regarding  datasets  with  more  complex  geometry,  like  in  the  moon  dataset  (see 
Fig.  7.7),  K-means  and  Ward  linkage  agglomerative  clustering  attempt  to  construct 
compact  clusters  and  thus  cannot  separate  the  moons.  Due  to  the  connectivity  con¬ 
straint,  the  spectral  clustering  and  the  average  linkage  agglomerative  clustering  sep¬ 
arated  both  moons  in  case  of  k  =  2,  while  in  case  of  k  =  4,  the  average  linkage 
agglomerative  clustering  clustered  most  of  datasets  correctly  separating  some  of  the 
noisy  data  points  as  two  separate  single  clusters.  In  the  case  of  spectral  clustering, 
looking  for  four  clusters,  the  method  splits  each  of  the  two  moon  datasets  into  two 
clusters. 
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Fig.  7.8  Expenditure  on  different  educational  indicators  for  the  first  five  countries  in  the  Eurostat 
dataset 


7.3  Case  Study 

In  order  to  illustrate  clustering  with  a  real  dataset,  we  will  now  analyze  the  indicators 
of  spending  on  education  among  the  European  Union  member  states,  provided  by 
the  Eurostat  data  bank.  The  data  are  organized  by  year  (TIME)  from  2002  until 
2011  and  country  (GEO):  (‘Albania’,  ‘Austria’,  ‘Belgium’,  ‘Bulgaria’,  etc.).  Twelve 
indicators  (INDIC_ED)  of  financing  of  education  with  their  corresponding  values 
(Value)  are  given:  (1)  Expenditure  on  educational  institutions  from  private  sources 
as  %  of  gross  domestic  product  (GDP),  for  all  levels  of  education  combined;  (2) 
Expenditure  on  educational  institutions  from  public  sources  as  %  of  GDP,  for  all 
levels  of  government  combined,  (3)  Expenditure  on  educational  institutions  from 
public  sources  as  %  of  total  public  expenditure,  for  all  levels  of  education  combined, 
(4)  Public  subsidies  to  the  private  sector  as  %  of  GDP,  for  all  levels  of  education 
combined,  (5)  Public  subsidies  to  the  private  sector  as  %  of  total  public  expenditure, 
for  all  levels  of  education  combined,  etc.  We  can  store  the  12  indicators  for  a  given 
year  (e.g.,  2010)  in  a  table.  Figure  7.8  provides  visualization  of  the  first  five  countries 
in  the  table. 

As  we  can  observe,  this  is  not  a  clean  dataset,  since  there  are  values  missing.  Some 
countries  have  very  limited  information  and  should  be  excluded.  Other  countries  may 
still  not  collect  or  have  access  to  a  few  indicators.  For  these  last  cases,  we  can  proceed 
in  two  ways:  (a)  fill  in  the  gaps  with  some  non-informative,  non-biasing  data;  or  (b) 
drop  the  features  with  missing  values  for  the  analysis.  If  we  have  many  features  and 
only  a  few  have  missing  values,  then  it  is  not  very  harmful  to  drop  them.  However,  if 
missing  values  are  spread  across  most  of  the  features,  we  eventually  have  to  deal  with 
them.  In  our  case,  both  options  seem  reasonable,  as  long  as  the  number  of  missing 
features  for  a  country  is  not  too  large.  We  will  proceed  in  both  ways  at  the  same  time. 

We  apply  both  options:  filling  the  gap  with  the  mean  value  of  the  feature  and 
the  dropping  option,  ignoring  the  indicators  with  missing  values.  Let  us  now  apply 
K-means  clustering  to  these  data  in  order  to  partition  the  countries  according  to 


3http://ec.europa.eu/eurostat. 
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Using  dropped  missing  values  data 
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Fig.  7.9  Clustering  of  the  countries  according  to  their  educational  expenditure  using  hlled-in  ( top 
row)  and  dropped  ( bottom  row )  missing  values 


their  investment  in  education  and  check  their  profiles.  Figure  7.9  shows  the  results 
of  this  K-means  clustering.  We  have  sorted  the  data  for  better  visualization.  At 
a  simple  glance,  we  can  see  that  the  partitions  (top  and  bottom  of  Fig.  7.9)  are 
different.  Most  countries  in  cluster  2  in  the  hlled-in  dataset  correspond  to  cluster  0 
in  the  dropped  missing  values  dataset.  Analogously,  most  of  cluster  0  in  the  hlled- 
in  dataset  correspond  to  cluster  1  in  the  dropped  missing  values  dataset;  and  most 
countries  from  cluster  1  in  the  hlled-in  dataset  correspond  to  cluster  2  in  the  dropped 
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0  1  2  3  4  5  6  7 

Economical  indicators 


Fig.  7.10  Mean  expenditure  of  the  different  clusters  according  to  the  8  indicators  of  the  indicators- 
dropped  dataset 


set.  Still,  there  are  some  countries  that  do  not  follow  this  rule.  That  is,  looking  at 
both  clusterings,  they  may  yield  similar  (up  to  label  permutation)  results,  but  they 
will  not  necessarily  always  coincide.  This  is  mainly  due  to  two  aspects:  the  random 
initialization  of  the  K-means  clustering  and  the  fact  that  each  method  works  in  a 
different  space  (i.e.,  dropped  data  in  8D  space  vs  filled-in  data,  working  in  12D 
space).  Note  that  we  should  not  consider  the  assigned  absolute  cluster  value,  since 
it  is  irrelevant.  The  mean  expenditure  of  the  different  clusters  is  shown  by  different 
colors  according  to  the  8  indicators  of  the  indicators-dropped  dataset  (see  Fig.  7.10). 

So,  without  loss  of  generality,  we  continue  analyzing  the  set  obtained  by  dropping 
missing  values.  Let  us  now  check  the  clusters  and  check  their  profile  by  looking  at 
the  centroids.  Visualizing  the  eight  values  of  the  three  clusters  (see  Fig.  7.10),  we  can 
see  that  cluster  1  spends  more  on  education  for  the  8  educational  indicators,  while 
cluster  0  is  the  one  with  least  resources  invested  in  education. 

Let  us  consider  a  specific  country,  e.g.,  Spain  and  its  expenditure  on  education. 
If  we  refine  cluster  0  further  and  check  how  close  members  are  from  this  cluster 
to  cluster  1,  it  may  give  us  a  hint  as  to  a  possible  ordering.  When  visualizing  the 
distance  to  cluster  0  and  1,  we  can  observe  that  Spain,  while  being  from  cluster  0,  has 
a  smaller  distance  to  cluster  1  (see  Fig.  7.11).  This  should  make  us  realize  that  using  3 
clusters  probably  does  not  sufficiently  represent  the  groups  of  countries.  So  we  redo 
the  process,  but  applying  k  =  4:  we  obtain  4  clusters.  This  time  cluster  0  includes 
the  EU  members  with  medium  expenditure  (Fig.  7.12).  This  reinforce  the  intuition 
about  Spain  being  a  limit  case  in  the  former  clustering.  The  clusters  obtained  are  as 
follows: 

•  Cluster  0:  (‘Austria’,  ‘Estonia’,  ‘EU13’,  ‘EU15’,  ‘EU25’,  ‘EU27’,  ‘France’, 

‘Germany’,  ‘Hungary’,  ‘Latvia’,  ‘Lithuania’,  ‘Netherlands’,  ‘Poland’,  ‘Portugal’, 

‘Slovenia’,  ‘Spain’,  ‘Switzerland’,  ‘United  Kingdom’,  ‘United  States’) 


7.3  Case  Study 
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Fig.  7.11  Distance  of  countries  in  cluster  0  to  centroids  of  cluster  0  (in  red )  and  cluster  1  (in  blue ) 
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Fig.  7.1 2  K-means  applied  to  the  Eurostat  dataset  grouping  the  countries  into  four  clusters 


•  Cluster  1:  (‘Bulgaria’,  ‘Croatia’,  ‘Czech  Republic’,  ‘Italy’,  ‘Japan’,  ‘Romania’, 
‘Slovakia’) 

•  Cluster  2:  (‘Cyprus’,  ‘Denmark’,  ‘Iceland’) 

•  Cluster  3:  (‘Belgium’,  ‘Finland’,  ‘Ireland’,  ‘Malta’,  ‘Norway’,  ‘Sweden’) 

We  can  repeat  the  process  using  the  alternative  clustering  techniques  and  compare 
their  results.  Let  us  first  apply  spectral  clustering.  The  corresponding  code  will  be 
as  follows: 
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Applying  Spectral  Clustering  on  the  drop  features 
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Fig.  7.1 3  Spectral  clustering  applied  to  the  European  countries  according  to  their  expenditure  on 
education 


In  [11] : 

f 

X  =  StandardScaler ( ) . f it_transform ( edudrop . values ) 
distances  =  euclidean_distances ( edudrop . values ) 
spectral  =  cluster . SpectralClustering ( 

n_clusters  =  4,  affinity  =  " neares t_neighbors " ) 

spectral . fit ( edudrop . values ) 

y_pred  =  spectral . labels_. astype (np. int) 

< _ 

_ y 

The  result  of  this  spectral  clustering  is  shown  in  Fig.  7.13.  Note  that  in  general, 
the  aim  of  spectral  clustering  is  to  obtain  more  balanced  clusters.  In  this  way,  the 
predicted  cluster  1  merges  clusters  2  and  3  of  the  K-means  clustering,  cluster  2 
corresponds  to  cluster  1  of  the  K-means  clustering,  cluster  0  mainly  shifts  to  cluster 
2,  and  cluster  3  corresponds  to  cluster  0  of  the  K-means. 

Applying  agglomerative  clustering,  not  only  we  do  obtain  different  clusters,  but 
also  we  can  see  how  different  clusters  are  obtained.  Thus,  in  some  way  it  is  giving 
us  information  on  which  the  most  similar  pairs  of  countries  and  clusters  are.  The 
corresponding  code  that  applies  the  agglomerative  clustering  will  be  as  follows: 
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In  [12] 


c - \ 

from  s c ipy . c lus t er . hi e r ar chy  import  linkage,  dendrogram 
from  s c ipy  .  spa t i a 1  . di s t anc e  import  pdist 

X_train  =  edudrop . values 

dist  =  pdist (X_train ,  'euclidean') 

1 inkage_mat r ix  =  linkage (dist ,  method  =  ' complete ' ) ; 

pit . figure ( f igsize  =  (11.3,  11.3)) 

dendrogram ( 1 i nkage_ma trix  ,  orientation=" right"  , 

c o 1 o r_t hr e sho 1 d  =  3, 
labels  =  wrk_count ries_names  , 
leaf_font_size  =  20); 

pit  .  tight_layout  (  ) 

V _ / 


In  Sc  ikit- learn,  the  parameter  color -threshold  of  the  command  dendro- 
gram()  colors  all  the  descendent  links  below  a  cluster  node  k  the  same  color  if  k  is 
the  first  node  below  the  color_threshold.  All  links  connecting  nodes  with  distances 
greater  than  or  equal  to  the  threshold  are  colored  blue.  Hence,  using  color -threshold 
=  3,  the  clusters  obtained  are  as  follows: 

•  Cluster  0:  (‘Cyprus’,  ‘Denmark’,  ‘Iceland’) 

•  Cluster  1:  (‘Bulgaria’,  ‘Croatia’,  ‘Czech  Republic’,  ‘Italy’,  ‘Japan’,  ‘Romania’, 
‘Slovakia’) 

•  Cluster  2:  (‘Belgium’,  ‘Finland’,  ‘Ireland’,  ‘Malta’,  ‘Norway’,  ‘Sweden’) 

•  Cluster  3:  (‘Austria’,  ‘Estonia’,  ‘EU13’,  ‘EU15’,  ‘EU25’,  ‘EU27’,  ‘France’, 
‘Germany’,  ‘Hungary’,  ‘Latvia’,  ‘Lithuania’,  ‘Netherlands’,  ‘Poland’,  ‘Portugal’, 
‘Slovenia’,  ‘Spain’,  ‘Switzerland’,  ‘United  Kingdom’,  ‘United  States’) 

Note  that,  to  a  high  degree,  they  correspond  to  the  clusters  obtained  by  the  K-means 
(except  for  permutation  of  cluster  labels,  which  is  irrelevant). 


Figure  7. 14  shows  the  construction  of  the  clusters  using  complete  linkage  agglom- 
erative  clustering.  Different  cuts  at  different  levels  of  the  dendrogram  allow  us  to 
obtain  different  numbers  of  clusters. 

To  summarize,  we  can  compare  the  results  of  the  three  clustering  approaches.  We 
cannot  expect  the  results  to  coincide,  since  the  different  approaches  are  based  on 
different  criteria  for  constructing  clusters.  Nonetheless,  we  can  still  observe  that  in 
this  case,  K-means  and  the  agglomerative  approaches  gave  the  same  results  (up  to  a 
permutation  of  the  number  of  cluster,  which  is  irrelevant);  while  spectral  clustering 
gave  more  evenly  distributed  clusters.  This  later  approach  fused  clusters  0  and  2  of 
the  agglomerative  clustering  in  cluster  1,  and  split  cluster  3  of  the  agglomerative 
clustering  into  its  clusters  0  and  3.  Note  that  these  results  could  change  when  using 
different  distances  among  data. 
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EU15 
EU13 
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Fig.  7.1 4  Agglomerative  clustering  applied  to  cluster  European  countries  according  to  their  expen¬ 
diture  on  education 


7.4  Conclusions 

In  this  chapter,  we  have  introduced  the  unsupervised  learning  problem  as  a  problem 
of  knowledge  or  structure  discovery  from  a  set  of  unlabeled  data.  We  have  focused 
on  clustering  as  one  of  the  main  problems  in  unsupervised  learning.  Basic  concepts 
such  as  distance,  similarity,  connectivity,  and  the  quality  of  the  clustering  results 
have  been  discussed  as  the  main  elements  to  be  determined  before  choosing  a  spe¬ 
cific  clustering  technique.  Three  basic  clustering  techniques  have  been  introduced: 
K-means,  agglomerative  clustering,  and  spectral  clustering.  We  have  discussed  their 
advantages  and  disadvantages  and  compared  them  through  different  examples.  One 
of  the  important  parameters  for  most  clustering  techniques  is  the  number  of  clusters 
expected. 

Regarding  scalability,  K-means  can  be  applied  to  very  large  datasets,  but  the 
number  of  clusters  should  be  as  much  as  medium  value,  due  to  its  iterative  procedure. 
Spectral  clustering  can  manage  datasets  that  are  not  very  large  and  a  reasonable 
number  of  clusters,  since  it  is  based  on  computing  the  eigenvectors  of  the  affinity 
matrix.  In  this  aspect,  the  best  option  is  hierarchical  clustering,  which  allows  large 
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numbers  of  samples  and  clusters  to  be  tackled.  Regarding  uses,  K-means  is  best 
suited  to  data  with  a  flat  geometry  (isotropic  and  compact  clusters),  while  spectral 
clustering  and  agglomerative  clustering,  with  either  average  or  complete  linkage, 
are  able  to  detect  patterns  in  data  with  non-flat  geometry.  The  connectivity  graph 
is  especially  helpful  in  such  cases.  At  the  end  of  the  chapter,  a  case  study  using  a 
Eurostat  database  has  been  considered  to  show  the  applicability  of  the  clustering  in 
real  problems  (with  real  datasets). 

Acknowledgements  This  chapter  was  co-written  by  Petia  Radeva  and  Oriol  Pujol. 
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Network  Analysis 


8.1  Introduction 

Network  data  are  generated  when  we  consider  relationships  between  two  or  more 
entities  in  the  data,  like  the  highways  connecting  cities,  friendships  between  peo¬ 
ple  or  their  phone  calls.  In  recent  years,  a  huge  number  of  network  data  are  being 
generated  and  analyzed  in  different  fields.  For  instance,  in  sociology  there  is  inter¬ 
est  in  analyzing  blog  networks,  which  can  be  built  based  on  their  citations,  to  look 
for  divisions  in  their  structures  between  political  orientations.  Another  example  is 
infectious  disease  transmission  networks,  which  are  built  in  epidemiological  studies 
to  find  the  best  way  to  prevent  infection  of  people  in  a  territory,  by  isolating  cer¬ 
tain  areas.  Other  examples  studied  in  the  field  of  technology  include  interconnected 
computer  networks  or  power  grids,  which  are  analyzed  to  optimize  their  functioning. 
We  also  find  examples  in  academia,  where  we  can  build  co-authorship  networks  and 
citation  networks  to  analyze  collaborations  among  Universities. 

Structuring  data  as  networks  can  facilitate  the  study  of  the  data  for  different  goals; 
for  example,  to  discover  the  weaknesses  of  a  structure.  That  could  be  the  objective 
of  a  biologist  studying  a  community  of  plants  and  trying  to  establish  which  of  its 
properties  promote  quick  transmission  of  a  disease.  A  contrasting  objective  would  be 
to  find  and  exploit  structures  that  work  efficiently  for  the  transmission  of  messages 
across  the  network.  This  may  be  the  goal  of  an  advertising  agent  trying  to  find  the 
best  strategy  for  spreading  publicity. 

How  to  analyze  networks  and  extract  the  features  we  want  to  study  are  some 
of  the  issues  we  consider  in  this  chapter.  In  particular,  we  introduce  some  basic 
concepts  related  with  networks,  such  as  connected  components,  centrality  measures, 
ego-networks,  and  PageRank.  We  present  some  useful  Python  tools  for  the  analysis 
of  networks  and  discuss  some  of  the  visualization  options.  In  order  to  motivate  and 
illustrate  the  concepts,  we  perform  social  network  analysis  using  real  data.  We  present 
a  practical  case  based  on  a  public  dataset  which  consists  of  a  set  of  interconnected 
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Facebook  friendship  networks.  We  formulate  multiple  questions  at  different  levels: 

the  local/member  level,  the  community  level,  and  the  global  level. 

In  general,  some  of  the  questions  we  try  to  solve  are  the  following: 

•  What  type  of  network  are  we  dealing  with? 

•  Which  is  the  most  representative  member  of  the  network  in  terms  of  being  the 
most  connected  to  the  rest  of  the  members? 

•  Which  is  the  most  representative  member  of  the  network  in  terms  of  being  the 
most  circulated  on  the  paths  between  the  rest  of  the  members? 

•  Which  is  the  most  representative  member  of  the  network  in  terms  of  proximity  to 
the  rest  of  the  members? 

•  Which  is  the  most  representative  member  of  the  network  in  terms  of  being  the 
most  accessible  from  any  location  in  the  network? 

•  There  are  many  ways  of  calculating  the  representativeness  or  importance  of  a 
member,  each  one  with  a  different  meaning,  so:  how  can  we  illustrate  them  and 
compare  them? 

•  Are  there  different  communities  in  the  network?  If  so,  how  many? 

•  Does  any  member  of  the  network  belong  to  more  than  one  community?  That  is, 
is  there  any  overlap  between  the  communities?  How  much  overlap?  How  can  we 
illustrate  this  overlap? 

•  Which  is  the  largest  community  in  the  network? 

•  Which  is  the  most  dense  community  (in  terms  of  connections)? 

•  How  can  we  automatically  detect  the  communities  in  the  network? 

•  Is  there  any  difference  between  automatically  detected  communities  and  real  ones 
(manually  labeled  by  users)? 


8.2  Basic  Definitions  in  Graphs 

Graph  is  the  mathematical  term  used  to  refer  to  a  network.  Thus,  the  field  that 
studies  networks  is  called  graph  theory  and  it  provides  the  tools  necessary  to  analyze 
networks.  Leonhard  Euler  defined  the  first  graph  in  1735,  as  an  abstraction  of  one  of 
the  problems  posed  by  mathematicians  of  the  time  regarding  Konigsberg,  a  city  with 
two  islands  created  by  the  River  Pregel,  which  was  crossed  by  seven  bridges.  The 
problem  was:  is  it  possible  to  walk  through  the  town  of  Konigsberg  crossing  each 
bridge  once  and  only  once?  Euler  represented  the  land  areas  as  nodes  and  the  bridges 
connecting  them  as  edges  of  a  graph  and  proved  that  the  walk  was  not  possible  for 
this  particular  graph. 

A  graph  is  defined  as  a  set  of  nodes ,  which  are  an  abstraction  of  any  entities 
(parts  of  a  city,  persons,  etc.),  and  the  connecting  links  between  pairs  of  nodes  called 
edges  or  relationships.  The  edge  between  two  nodes  can  be  directed  or  undirected.  A 
directed  edge  means  that  the  edge  points  from  one  node  to  the  other  and  not  the  other 
way  round.  An  example  of  a  directed  relationship  is  “a  person  knows  another  person”. 
An  edge  has  a  direction  when  person  A  knows  person  B,  and  not  the  reverse  direction 
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Fig.  8.1  Simple  undirected 
labeled  graph  with  5  nodes 
and  5  edges 


if  B  does  not  know  A  (which  is  usual  for  many  fans  and  celebrities).  An  undirected 
edge  means  that  there  is  a  symmetric  relationship.  An  example  is  “a  person  shook 
hands  with  another  person”;  in  this  case,  the  relationship,  unavoidably,  involves  both 
persons  and  there  is  no  directionality.  Depending  on  whether  the  edges  of  a  graph  are 
directed  or  undirected,  the  graph  is  called  a  directed  graph  or  an  undirected  graph , 
respectively. 

The  degree  of  a  node  is  the  number  of  edges  that  connect  to  it.  Figure  8.1  shows 
an  example  of  an  undirected  graph  with  5  nodes  and  5  edges.  The  degree  of  node  C 
is  1,  while  the  degree  of  nodes  A,  D  and  E  is  2  and  for  node  B  it  is  3.  If  a  network  is 
directed,  then  nodes  have  two  different  degrees,  the  in-degree ,  which  is  the  number 
of  incoming  edges,  and  the  out-degree ,  which  is  the  number  of  outgoing  edges. 

In  some  cases,  there  is  information  we  would  like  to  add  to  graphs  to  model 
properties  of  the  entities  that  the  nodes  represent  or  their  relationships.  We  could  add 
strengths  or  weights  to  the  links  between  the  nodes,  to  represent  some  real-world 
measure.  For  instance,  the  length  of  the  highways  connecting  the  cities  in  a  network. 
In  this  case,  the  graph  is  called  a  weighted  graph. 

Some  other  elementary  concepts  that  are  useful  in  graph  analysis  are  those  we 
explain  in  what  follows.  We  define  a  path  in  a  network  to  be  a  sequence  of  nodes 
connected  by  edges.  Moreover,  many  applications  of  graphs  require  shortest  paths 
to  be  computed.  The  shortest  path  problem  is  the  problem  of  finding  a  path  between 
two  nodes  in  a  graph  such  that  the  length  of  the  path  or  the  sum  of  the  weights  of 
edges  in  the  path  is  minimized.  In  the  example  in  Fig.  8.1,  the  paths  (C,  A,  B,  E)  and 
(C,  A,  B,  D,  E)  are  those  between  nodes  C  and  E.  This  graph  is  unweighted,  so  the 
shortest  path  between  C  and  E  is  the  one  that  follows  the  fewer  edges:  (C,  A,  B,  E). 

A  graph  is  said  to  be  connected  if  for  every  pair  of  nodes,  there  is  a  path  between 
them.  A  graph  is  fully  connected  or  complete  if  each  pair  of  nodes  is  connected  by 
an  edge.  A  connected  component  or  simply  a  component  of  a  graph  is  a  subset  of  its 
nodes  such  that  every  node  in  the  subset  has  a  path  to  every  other  one.  In  the  example 
of  Fig.  8.1,  the  graph  has  one  connected  component.  A  subgraph  is  a  subset  of  the 
nodes  of  a  graph  and  all  the  edges  linking  those  nodes.  Any  group  of  nodes  can  form 
a  subgraph. 
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8.3  Social  Network  Analysis 

Social  network  analysis  processes  social  data  structured  in  graphs.  It  involves  the 
extraction  of  several  characteristics  and  graphics  to  describe  the  main  properties  of 
the  network.  Some  general  properties  of  networks,  such  as  the  shape  of  the  network 
degree  distribution  (defined  bellow)  or  the  average  path  length,  determine  the  type 
of  network,  such  as  a  small-world  network  or  a  scale-free  network.  A  small-world 
network  is  a  type  of  graph  in  which  most  nodes  are  not  neighbors  of  one  another,  but 
most  nodes  can  be  reached  from  every  other  node  in  a  small  number  of  steps.  This 
is  the  so-called  small-world  phenomenon  which  can  be  interpreted  by  the  fact  that 
strangers  are  linked  by  a  short  chain  of  acquaintances.  In  a  small- world  network, 
people  usually  form  communities  or  small  groups  where  everyone  knows  every¬ 
one  else.  Such  communities  can  be  seen  as  complete  graphs.  In  addition,  most  the 
community  members  have  a  few  relationships  with  people  outside  that  community. 
However,  some  people  are  connected  to  a  large  number  of  communities.  These  may 
be  celebrities  and  such  people  are  considered  as  the  hubs  that  are  responsible  for 
the  small-world  phenomenon.  Many  small-world  networks  are  also  scale-free  net¬ 
works.  In  a  scale-free  network  the  node  degree  distribution  follows  a  power  law  (a 
relationship  function  between  two  quantities  v  and  y  defined  as  y  =  xn,  where  n  is 
a  constant).  The  name  scale -free  comes  from  the  fact  that  power  laws  have  the  same 
functional  form  at  all  scales,  i.e.,  their  shape  does  not  change  on  multiplication  by  a 
scale  factor.  Thus,  by  definition,  a  scale-free  network  has  many  nodes  with  a  very  few 
connections  and  a  small  number  of  nodes  with  many  connections.  This  structure  is 
typical  of  the  World  Wide  Web  and  other  social  networks.  In  the  following  sections, 
we  illustrate  this  and  other  graph  properties  that  are  useful  in  social  network  analysis. 


8.3.1  Basics  in  NetworkX 

NetworkX  is  a  Python  toolbox  for  the  creation,  manipulation  and  study  of  the  struc¬ 
ture,  dynamics  and  functions  of  complex  networks.  After  importing  the  toolbox,  we 
can  create  an  undirected  graph  with  5  nodes  by  adding  the  edges,  as  is  done  in  the 
following  code.  The  output  is  the  graph  in  Fig.  8.1. 


1  https://networkit.iti.kit.edu. 
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In  [2]  : 


Out  [  2  ]  : 


In  [3]  : 


In  [4]  : 


Out  [  4  ]  : 


8.3.2  Practical  Case:  Facebook  Dataset 

For  our  practical  case  we  consider  data  from  the  Facebook  network.  In  particular,  we 
use  the  data  Social  circles :  Facebook1  from  the  Stanford  Large  Network  Dataset 
(SNAP)  collection.  The  SNAP  collection  has  links  to  a  great  variety  of  networks 
such  as  Facebook- style  social  networks,  citation  networks,  Twitter  networks  or  open 
communities  like  Live  Journal.  The  Facebook  dataset  consists  of  a  network  repre¬ 
senting  friendship  between  Facebook  users.  The  Facebook  data  was  anonymized  by 
replacing  the  internal  Facebook  identifiers  for  each  user  with  a  new  value. 

The  network  corresponds  to  an  undirected  and  unweighted  graph  that  contains 
users  of  Facebook  (nodes)  and  their  friendship  relations  (edges).  The  Facebook 
dataset  is  defined  by  an  edge  list  in  a  plain  text  file  with  one  edge  per  line. 

Let  us  load  the  Facebook  network  and  start  extracting  the  basic  information  from 
the  graph,  including  the  numbers  of  nodes  and  edges,  and  the  average  degree: 

f  \ 

fb  =  nx . read_edgelist (" files / ch08 / f acebook_combined . txt " ) 

fb_n  ,  fb_k  =  fb  .  order  (  )  ,  fb  .  size  (  ) 

fb_avg_deg  =  fb_k  /  fb_n 

print  'Nodes :  ' ,  fb_n 

print  ' Edges :  ' ,  fb_k 

print  'Average  degree:  ' ,  fb_avg_deg 
\ _ / 


Nodes:  4039 
Edges:  88234 
Average  degree:  21 

The  Facebook  dataset  has  a  total  of 4,039  users  and  88,234  friendship  connections, 
with  an  average  degree  of  21.  In  order  to  better  understand  the  graph,  let  us  compute 
the  degree  distribution  of  the  graph.  If  the  graph  were  directed,  we  would  need  to 
generate  two  distributions:  one  for  the  in-degree  and  another  for  the  out-degree.  A 
way  to  illustrate  the  degree  distribution  is  by  computing  the  histogram  of  degrees 
and  plotting  it,  as  the  following  code  does  with  the  output  shown  in  Fig.  8.2: 

C  \ 

degrees  =  fb. degree (). values () 
degree_hist  =  pit . hist  ( degrees  ,  100) 

\ _ / 

The  graph  in  Fig.  8.2  is  a  power-law  distribution.  Thus,  we  can  say  that  the  Face- 
book  network  is  a  scale-free  network. 

Next,  let  us  find  out  if  the  Facebook  dataset  contains  more  than  one  connected 
component  (previously  defined  in  Sect.  8.2): 

C  \ 

print  '#  connected  components  of  Facebook  network:  ', 
nx . number_connec t ed_component s ( f b ) 

V _ / 


#  connected  components  of  Facebook  network:  1 


2https://snap. stanford.edu/data/egonets-Facebook.html. 
3  http :  //  snap .  Stanford,  edu/data/ . 
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In  [  5  ]  : 


Out  [  5  ]  : 


In  [  6  ]  : 


Out  [  6  ]  : 


As  it  can  be  seen,  there  is  only  one  connected  component  in  the  Facebook  network. 
Thus,  the  Facebook  network  is  a  connected  graph  (see  definition  in  Sect.  8.2).  We  can 
try  to  divide  the  graph  into  different  connected  components,  which  can  be  potential 
communities  (see  Sect.  8.6).  To  do  that,  we  can  remove  one  node  from  the  graph 
(this  operation  also  involves  removing  the  edges  linking  the  node)  and  see  if  the 
number  of  connected  components  of  the  graph  changes.  In  the  following  code,  we 
prune  the  graph  by  removing  node  ‘O’  (arbitrarily  selected)  and  compute  the  number 

of  connected  components  of  the  pruned  version  of  the  graph: 

/  \ 

fb_prun  =  nx . r ead_edge 1 i s t ( 

" files /ch08 / facebook_combined . txt " ) 
fb_prun . remove_node ( ' 0 ' ) 

print  ' Remaining  nodes : ' ,  fb_prun . number_o f _node s ( ) 
print  'New  #  connected  components: ' , 

nx . number_connec t ed_component s ( f b_prun ) 

\ _ / 


Remaining  nodes:  4038 

New  #  connected  components:  19 


Now  there  are  19  connected  components,  but  let  us  see  how  big  the  biggest  is  and 
how  small  the  smallest  is: 


f b_component s 

=  nx .  connect  ed_c  omp  onen  t  s  ( f b_pr un ) 

print  'Sizes 

of  the  connected  components ' , 

[ 1 en ( c ) 

v 

for  c  in  f b_c omponen t s ] 

J 

Sizes  of  the  connected  components  [4015,  1,  3,  2,  2,  1,  1,  1, 

1,  1,  1,  1,  1,  2,  1,  1,  1,  1,  1] 

This  simple  example  shows  that  removing  a  node  splits  the  graph  into  multiple 
components.  You  can  see  that  there  is  one  large  connected  component  and  the  rest 
are  almost  all  isolated  nodes.  The  isolated  nodes  in  the  pruned  graph  were  only 
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Fig.  8.2  Degree  histogram  distribution 
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connected  to  node  ‘O’  in  the  original  graph  and  when  that  node  was  removed  they 
were  converted  into  connected  components  of  size  1.  These  nodes,  only  connected 
to  one  neighbor,  are  probably  not  important  nodes  in  the  structure  of  the  graph.  We 
can  generalize  the  analysis  by  studying  the  centrality  of  the  nodes.  The  next  section 
is  devoted  to  explore  this  concept. 


8.4  Centrality 

The  centrality  of  a  node  measures  its  relative  importance  within  the  graph.  In  this 
section  we  focus  on  undirected  graphs.  Centrality  concepts  were  first  developed  in 
social  network  analysis.  The  first  studies  indicated  that  central  nodes  are  probably 
more  influential,  have  greater  access  to  information,  and  can  communicate  their 
opinions  to  others  more  efficiently  [1].  Thus,  the  applications  of  centrality  concepts 
in  a  social  network  include  identifying  the  most  influential  people,  the  most  informed 
people,  or  the  most  communicative  people.  In  practice,  what  centrality  means  will 
depend  on  the  application  and  the  meaning  of  the  entities  represented  as  nodes  in  the 
data  and  the  connections  between  those  nodes.  Various  measures  of  the  centrality 
of  a  node  have  been  proposed.  We  present  four  of  the  best-known  measures:  degree 
centrality ,  betweenness  centrality ,  closeness  centrality ,  and  eigenvector  centrality. 

Degree  centrality  is  defined  as  the  number  of  edges  of  the  node.  So  the  more  ties  a 
node  has,  the  more  central  the  node  is.  To  achieve  a  normalized  degree  centrality  of  a 
node,  the  measure  is  divided  by  the  total  number  of  graph  nodes  ( n )  without  counting 
this  particular  one  (n  —  1).  The  normalized  measure  provides  proportions  and  allows 
us  to  compare  it  among  graphs.  Degree  centrality  is  related  to  the  capacity  of  a  node 
to  capture  any  information  that  is  floating  through  the  network.  In  social  networks, 
connections  are  associated  with  positive  aspects  such  as  knowledge  or  friendship. 

Betweenness  centrality  quantifies  the  number  of  times  a  node  is  crossed  along 
the  shortest  path/s  between  any  other  pair  of  nodes.  For  the  normalized  measure 
this  number  is  divided  by  the  total  number  of  shortest  paths  for  every  pair  of  nodes. 
Intuitively,  if  we  think  of  a  public  bus  transportation  network,  the  bus  stop  (node) 
with  the  highest  betweenness  has  the  most  traffic.  In  social  networks,  a  person  with 
high  betweenness  has  more  power  in  the  sense  that  more  people  depend  on  him/her 
to  make  connections  with  other  people  or  to  access  information  from  other  people. 
Comparing  this  measure  with  degree  centrality,  we  can  say  that  degree  centrality 
depends  only  on  the  node’s  neighbors;  thus,  it  is  more  local  than  the  betweenness 
centrality,  which  depends  on  the  connection  properties  of  every  pair  of  nodes  in  the 
graph,  except  pairs  with  the  node  in  question  itself.  The  equivalent  measure  exists 
for  edges.  The  betweenness  centrality  of  an  edge  is  the  proportion  of  the  shortest 
paths  between  all  node  pairs  which  pass  through  it. 

Closeness  centrality  tries  to  quantify  the  position  a  node  occupies  in  the  network 
based  on  a  distance  calculation.  The  distance  metric  used  between  a  pair  of  nodes 
is  defined  by  the  length  of  its  shortest  path.  The  closeness  of  a  node  is  inversely 
proportional  to  the  length  of  the  average  shortest  path  between  that  node  and  all  the 
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other  nodes  in  the  graph.  In  this  case,  we  interpret  a  central  node  as  being  close  to, 
and  able  to  communicate  quickly  with,  the  other  nodes  in  a  social  network. 

Eigenvector  centrality  defines  a  relative  score  for  a  node  based  on  its  connections 
and  considering  that  connections  from  high  centrality  nodes  contribute  more  to  the 
score  of  the  node  than  connections  from  low  centrality  nodes.  It  is  a  measure  of  the 
influence  of  a  node  in  a  network,  in  the  following  sense:  it  measures  the  extent  to 
which  a  node  is  connected  to  influential  nodes.  Accordingly,  an  important  node  is 
connected  to  important  neighbors. 

Let  us  illustrate  the  centrality  measures  with  an  example.  In  Fig.  8.3,  we  show 
an  undirected  star  graph  with  n  =  8  nodes.  Node  C  is  obviously  important,  since 
it  can  exchange  information  with  more  nodes  than  the  others.  The  degree  centrality 
measures  this  idea.  In  this  star  network,  node  C  has  a  degree  centrality  of  7  or  1 
if  we  consider  the  normalized  measure,  whereas  all  other  nodes  have  a  degree  of  1 
or  1/7  if  we  consider  the  normalized  measure.  Another  reason  why  node  C  is  more 
important  than  the  others  in  this  star  network  is  that  it  lies  between  each  of  the  other 
pairs  of  nodes,  and  no  other  node  lies  between  C  and  any  other  node.  If  node  C 
wants  to  contact  F,  C  can  do  it  directly;  whereas  if  node  F  wants  to  contact  B,  it 
must  go  through  C.  This  gives  node  C  the  capacity  to  broke/prevent  contact  among 
other  nodes  and  to  isolate  nodes  from  information.  The  betweenness  centrality  is 
underneath  this  idea.  In  this  example,  the  betweenness  centrality  of  the  node  C  is  28, 
computed  as  (n  —  \){n  —  2) / 2,  while  the  rest  of  nodes  have  a  betweenness  of  0.  The 
final  reason  why  we  can  say  node  C  is  superior  in  the  star  network  is  because  C  is 
closer  to  more  nodes  than  any  other  node  is.  In  the  example,  node  C  is  at  a  distance 
of  1  from  all  other  7  nodes  and  each  other  node  is  at  a  distance  2  from  all  other  nodes, 
except  C.  So,  node  C  has  closeness  centrality  of  1/7,  while  the  rest  of  nodes  have  a 
closeness  of  1/13.  The  normalized  measures,  computed  by  dividing  by  n  —  1,  are  1 
for  C  and  7/13  for  the  other  nodes. 

An  important  concept  in  social  network  analysis  is  that  of  a  hub  node,  which  is 
defined  as  a  node  with  high  degree  centrality  and  betweenness  centrality.  When  a 
hub  governs  a  very  centralized  network,  the  network  can  be  easily  fragmented  by 
removing  that  hub. 

Coming  back  to  the  Facebook  example,  let  us  compute  the  degree  centrality  of 
Facebook  graph  nodes.  In  the  code  below  we  show  the  user  identifier  of  the  10  most 
central  nodes  together  with  their  normalized  degree  centrality  measure.  We  also 
show  the  degree  histogram  to  extract  some  more  information  from  the  shape  of  the 
distribution.  It  might  be  useful  to  represent  distributions  using  logarithmic  scale.  We 


Fig.  8.3  Star  graph  example 
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In  [7]  : 


Out  [  7  ]  : 


do  that  with  the  matplotlib .  loglog  ( )  function.  Figure  8.4  shows  the  degree 

centrality  histogram  in  linear  and  logarithmic  scales  as  computed  in  the  box  bellow. 

/ - \ 

degree_cent_f b  =  nx . degr ee_c en t r a 1 i t y ( f b ) 
print  ' Facebook  degree  centrality:  ' , 
sorted ( degree_cent_fb . items ( ) , 
key  =  lambda  x:  x[l] , 
reverse  =  True)  [  :  1 0 ] 

degr e  e_hi st  =  pit. hist  (list  (degree_cent_fb. values  ()),  100) 

pit  .  loglog  ( degr ee_hi s  t  [ 1 ]  [ 1 :  ]  , 

degr ee_hi st [0] ,  ' b ' ,  marker  =  'o') 

v _ y 


Facebook  degree  centrality:  [(u'107',  0.258791480931154), 

(u ' 1684 ' ,  0.1961367013372957)  ,  (u'1912' ,  0 . 18697374938088163 )  , 

(u 7  3437  7 ,  0.13546310054482416) ,  (u'0' ,  0.08593363051015354) , 

(u7 2543 7 ,  0.07280832095096582) ,  (u7 2347 7 ,  0 . 07206537890044576) , 

(u 7 1888  7 ,  0.062  9  02  42  6944  0317)  ,  (u7 1800 7 ,  0 . 060673 60079247152 )  , 

(u 7 1663  7 ,  0.05819712  72  9  07  379  84)] 

The  previous  plots  show  us  that  there  is  an  interesting  (large)  set  of  nodes  which 
corresponds  to  low  degrees.  The  representation  using  a  logarithmic  scale  (right-hand 
graphic  in  Fig.  8.4)  is  useful  to  distinguish  the  members  of  this  set  of  nodes,  which 
are  clearly  visible  as  a  straight  line  at  low  values  for  the  x-axis  (upper  left-hand 
part  of  the  logarithmic  plot).  We  can  conclude  that  most  of  the  nodes  in  the  graph 
have  low  degree  centrality;  only  a  few  of  them  have  high  degree  centrality.  These 
latter  nodes  can  be  properly  seen  as  the  points  in  the  bottom  right-hand  part  of  the 
logarithmic  plot. 

The  next  code  computes  the  betweenness,  closeness,  and  eigenvector  centrality 
and  prints  the  top  10  central  nodes  for  each  measure. 
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Fig.  8.4  Degree  centrality  histogram  shown  using  a  linear  scale  (left)  and  a  log  scale  for  both  the 
x-  and  y-axis  (right) 


150 


8  Network  Analysis 


In  [8]  : 


Out  [  8  ]  : 


Out  [  8  ]  : 


f  \ 

be tweenne s s_f b  =  nx . be twe enne s s_c en t r a 1 i t y ( f b ) 
c 1 o s ene s s_ f b  =  nx . c 1 o s ene s s_c en t r a 1 i t y ( f b ) 
e i gene en t r a 1 i t y_ f b  =  nx . e i genvec t o r_c en t r a 1 i ty ( f b ) 
print  ' Facebook  betweenness  centrality: ' , 
sorted (betweenness_fb. items () , 
key  =  lambda  x:  x[l] , 
reverse  =  True)  [:10] 
print  'Facebook  closeness  centrality: ' , 
sorted ( closeness_fb . items ( ) , 
key  =  lambda  x:  x[l] , 
reverse  =  True)  [  :  1  0 ] 

print  'Facebook  eigenvector  centrality: ' , 
sorted ( eigencentrali t y_f b .  i t  ems  ( )  , 
key  =  lambda  x:  x[l] , 
reverse  =  True)  [:10] 

V _ / 


Facebook  betweenness  centrality:  [(u'107',  0.4805180785560141), 
(u ' 1684 ' ,  0.33779744973019843) ,  (u'3437' ,  0.23611535735892616) , 

(u ' 1912 ' ,  0.2292953395868727) ,  (u'1085' ,  0.1490150921166526) , 

(u ' 0 ' ,  0.1463059214744276) ,  (u'698' ,  0.11533045020560861) , 

(u ' 567  ' ,  0.09631033121856114) ,  (u'58' ,  0.08436020590796521) , 

(u ' 42  8 ' ,  0.0  643  09  062  3  9323  9  08) ] 


Facebook  closeness  centrality:  [(u'107',  0.45969945355191255), 
(u'58' ,  0.3974018305284913) ,  (u'428' ,  0.3948371956585509) , 

(u ' 563 ' ,  0.3  93  912  78899  619  55)  ,  (u'1684' ,  0 . 393  60561458231796 )  , 

(u ' 171 ' ,  0.37049270575282134),  (u'348',  0.36991572004397216), 

(u ' 483 ' ,  0.3  698479575013739)  ,  (u'414' ,  0 . 369  543  33  302827  86 )  , 

(u ' 376 ' ,  0.36655773420479304) ] 

Facebook  eigenvector  centrality:  [(u'1912',  0.09540688873596524), 
(u ' 22  6  6 ' ,  0.0869832  822  6321951)  ,  (u'22  06' ,  0.08605240174265624)  , 

(u ' 2233 ' ,  0.08517341350597836) ,  (u'2464' ,  0.08427878364685948) , 

(u ' 2 142 ' ,  0.08419312450068105) ,  (u'2218' ,  0.08415574433673866) , 

(u ' 2  07  8 ' ,  0.08413  617  905810111),  (u'2123' ,  0.083  6714212  5897363), 

(u ' 1993 ' ,  0.08353243711860482) ] 

As  can  be  seen  in  the  previous  results,  each  measure  gives  a  different  ordering  of 
the  nodes.  The  node  ‘107’  is  the  most  central  node  for  degree  (see  box  Out  [  7  ] ), 
betweenness,  and  closeness  centrality,  while  it  is  not  among  the  10  most  central  nodes 
for  eigenvector  centrality.  The  second  most  central  node  is  different  for  closeness 
and  eigenvector  centralities;  while  the  third  most  central  node  is  different  for  all  four 
centrality  measures. 

Another  interesting  measure  is  the  current  flow  betweenness  centrality ,  also  called 
random  walk  betweenness  centrality ,  of  a  node.  It  can  be  defined  as  the  probability 
of  passing  through  the  node  in  question  on  a  random  walk  starting  and  ending  at 
some  node.  In  this  way,  the  betweenness  is  not  computed  as  a  function  of  shortest 
paths,  but  of  all  paths.  This  makes  sense  for  some  social  networks  where  messages 
may  get  to  their  final  destination  not  by  the  shortest  path,  but  by  a  random  path,  as 
in  the  case  of  gossip  floating  through  a  social  network  for  example. 

Computing  the  current  flow  betweenness  centrality  can  take  a  while,  so  we  will 
work  with  a  trimmed  Facebook  network  instead  of  the  original  one.  In  fact,  we  can 
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In  [9]  : 


Out  [  9  ]  : 


In  [10] : 


pose  the  question:  What  happen  if  we  only  consider  the  graph  nodes  with  more  than 
the  average  degree  of  the  network  (21)?  We  can  trim  the  graph  using  degree  centrality 
values.  To  do  this,  in  the  next  code,  we  define  a  function  to  trim  the  graph  based  on 
the  degree  centrality  of  the  graph  nodes.  We  set  the  threshold  to  21  connections: 


Degree  centrality  threshold:  0.00520059435364 
Remaining  #  nodes:  2226 

The  new  graph  is  much  smaller;  we  have  removed  almost  half  of  the  nodes  (we 
have  moved  from  4,039  to  2,226  nodes). 

The  current  flow  betweenness  centrality  measure  needs  connected  graphs,  as  does 
any  betweenness  centrality  measure,  so  we  should  first  extract  a  connected  compo¬ 
nent  from  the  trimmed  Facebook  network  and  then  compute  the  measure: 

/  \ 

fb_subgraph  =  list (nx. connected_component_subgraphs ( 
f b_t r ime  d )  ) 

print  '  #  subgraphs  found:',  size ( fb_subgraph ) 
print  '#  nodes  in  the  first  subgraph:  '  , 
len ( fb_subgraph [0] ) 

betweenness  =  nx . be t we enne s s_c en t r a 1 i t y ( f b_subgr aph [ 0 ] ) 
print  'Trimmed  FB  betweenness:  ', 

sorted (betweenness  .  items  ( )  ,  key  =  1 ambda  x :  x  [  1 ]  , 

reverse  =  True)  [  :  1  0 ] 

current_f low  =  nx . current_f low_betweenness_cent ral i ty ( 
fb_subgraph [0] ) 

print  'Trimmed  FB  current  flow  betweenness : ' , 

sorted (current_flow. items (),  key  =  1 ambda  x :  x [ 1 ] , 

reverse  =  True)  [  :  1 0 ] 

v _ y 
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Fig.  8.5  The  Facebook 
network  with  a  random 
layout 


Out [10] :  #  subgraphs  found:  2 

#  nodes  in  the  first  subgraph:  2225 

Trimmed  FB  betweenness:  [(u'107',  0.5469164906683255), 

(u ' 1684 ' ,  0.313  3  9  66633778371)  ,  (u'1912' ,  0 . 199  655974572  46995 )  , 

(u'3437' ,  0.13002843874261014) ,  (u'1577' ,  0.1274607407928195), 

(u ' 1085 ' ,  0.11517250980098293) ,  (u'1718' ,  0.08916631761105698) , 

(u ' 42  8 ' ,  0.0  63  827182  79123  78)  ,  (u'14  65' ,  0 . 057  99  59  007  477  317  55 )  , 

(u ' 567  ' ,  0.05414376521577943) ] 

Trimmed  FB  current  flow  betweenness:  [(u'107', 

0.2858892136334576) ,  (u'1718' ,  0.2678396761785764) ,  (u'1684' , 

0.1585162194931393) ,  (u'1085' ,  0.1572155780323929) ,  (u'1405' , 

0.1253563113363113) ,  (u'3437' ,  0.10482568101478178) ,  (u'1912' , 

0.09369897700970155) ,  (u'1577' ,  0.08897207040045449) ,  (u'136' , 

0 . 07052866082249776)  ,  (u'15  05' ,  0.06152  34704  6861114) ] 

As  can  be  seen,  there  are  similarities  in  the  10  most  central  nodes  for  the  between¬ 
ness  and  current  flow  betweenness  centralities.  In  particular,  seven  up  to  ten  are  the 
same  nodes,  even  if  they  are  differently  ordered. 


8.4.1  Drawing  Centrality  in  Graphs 

In  this  section  we  focus  on  graph  visualization,  which  can  help  in  the  network  data 
understanding  and  usability. 

The  visualization  of  a  network  with  a  large  amount  of  nodes  is  a  complex  task. 
Different  layouts  can  be  used  to  try  to  build  a  proper  visualization.  For  instance,  we 
can  draw  the  Facebook  graph  using  the  random  layout  (nx.  random_layout), 
but  this  is  a  bad  option,  as  can  be  seen  in  Fig.  8.5.  Other  alternatives  can  be  more 
useful.  In  the  box  below,  we  use  the  Spring  layout,  as  it  is  used  in  the  default  function 
(nx .  draw),  but  with  more  iterations.  The  function  nx .  spring_layout  returns 
the  position  of  the  nodes  using  the  Fruchterman-Reingold  force-directed  algorithm. 
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Fig.  8.6  The  Facebook 
network  drawn  using  the 
Spring  layout  and  degree 
centrality  to  define  the  node 
size 


This  algorithm  distributes  the  graph  nodes  in  such  a  way  that  all  the  edges  are  more 
or  less  equally  long  and  they  cross  themselves  as  few  times  as  possible.  Moreover, 
we  can  change  the  size  of  the  nodes  to  that  defined  by  their  degree  centrality.  As 
can  be  seen  in  the  code,  the  degree  centrality  is  normalized  to  values  between  0  and 
1 ,  and  multiplied  by  a  constant  to  make  the  sizes  appropriate  for  the  format  of  the 
figure: 


The  resulting  graph  visualization  is  shown  in  Fig.  8.6.  This  illustration  allows  us 
to  understand  the  network  better.  Now  we  can  distinguish  several  groups  of  nodes  or 
“communities”  clearly  in  the  graph.  Moreover,  the  larger  nodes  are  the  more  central 
nodes,  which  are  highly  connected  of  the  Facebook  graph. 

We  can  also  use  the  betweenness  centrality  to  define  the  size  of  the  nodes.  In  this 
way,  we  obtain  a  new  illustration  stressing  the  nodes  with  higher  betweenness,  which 
are  those  with  a  large  influence  on  the  transfer  of  information  through  the  network. 
The  new  graph  is  shown  in  Fig.  8.7.  As  expected,  the  central  nodes  are  now  those 
connecting  the  different  communities. 

Generally  different  centrality  metrics  will  be  positively  correlated,  but  when  they 
are  not,  there  is  probably  something  interesting  about  the  network  nodes.  For  instance, 
if  you  can  spot  nodes  with  high  betweenness  but  relatively  low  degree,  these  are  the 
nodes  with  few  links  but  which  are  crucial  for  network  flow.  We  can  also  look  for 
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Fig.  8.7  The  Facebook 
network  drawn  using  the 
Spring  layout  and 
betweenness  centrality  to 
define  the  node  size 


the  opposite  effect:  nodes  with  high  degree  but  relatively  low  betweenness.  These 
nodes  are  those  with  redundant  communication. 

Changing  the  centrality  measure  to  closeness  and  eigenvector,  we  obtain  the 
graphs  in  Figs.  8.8  and  8.9,  respectively.  As  can  be  seen,  the  central  nodes  are 
also  different  for  these  measures.  With  this  or  other  visualizations  you  will  be  able 
to  discern  different  types  of  nodes.  You  can  probably  see  nodes  with  high  closeness 
centrality  but  low  degree;  these  are  essential  nodes  linked  to  a  few  important  or  active 
nodes.  If  the  opposite  occurs,  if  there  are  nodes  with  high  degree  centrality  but  low 
closeness,  these  can  be  interpreted  as  nodes  embedded  in  a  community  that  is  far 
removed  from  the  rest  of  the  network. 

In  other  examples  of  social  networks,  you  could  find  nodes  with  high  closeness 
centrality  but  low  betweenness;  these  are  nodes  near  many  people,  but  since  there 
may  be  multiple  paths  in  the  network,  they  are  not  the  only  ones  to  be  near  many 
people.  Finally,  it  is  usually  difficult  to  find  nodes  with  high  betweenness  but  low 
closeness,  since  this  would  mean  that  the  node  in  question  monopolized  the  links 
from  a  small  number  of  people  to  many  others. 


8.4.2  PageRank 

PageRank  is  an  algorithm  related  to  the  concept  of  eigenvector  centrality  in  directed 
graphs.  It  is  used  to  rate  webpages  objectively  and  effectively  measure  the  attention 
devoted  to  them.  PageRank  was  invented  by  Larry  Page  and  Sergey  Brin,  and  became 
a  Google  trademark  in  1998  [2]. 

Assigning  the  importance  of  a  webpage  is  a  subjective  task,  which  depends  on  the 
interests  and  knowledge  of  the  persons  that  browse  the  webpages.  However,  there 
are  ways  to  objectively  rank  the  relative  importance  of  webpages. 
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Fig.  8.8  The  Facebook 
network  drawn  using  the 
Spring  layout  and  closeness 
centrality  to  define  the  node 
size 


Fig.  8.9  The  Facebook 
network  drawn  using  the 
Spring  layout  and 
eigenvector  centrality  to 
define  the  node  size 


We  consider  the  directed  graph  formed  by  nodes  corresponding  to  the  webpages 
and  edges  corresponding  to  the  hyperlinks.  Intuitively,  a  hyperlink  to  a  page  counts 
as  a  vote  of  support  and  a  page  has  a  high  rank  if  the  sum  of  the  ranks  of  its  incoming 
edges  is  high.  This  considers  both  cases  when  a  page  has  many  incoming  links  and 
when  a  page  has  a  few  highly  ranked  incoming  links.  Nowadays,  a  variant  of  the 
algorithm  is  used  by  Google.  It  does  not  only  use  information  on  the  number  of  edges 
pointing  into  and  out  of  a  website,  but  uses  many  more  variables. 

We  can  describe  the  PageRank  algorithm  from  a  probabilistic  point  of  view.  The 
rank  of  page  Pi  is  the  probability  that  a  surfer  on  the  Internet  who  starts  visiting  a 
random  page  and  follows  links,  visits  the  page  Pi .  With  more  details,  we  consider 
that  the  weights  assigned  to  the  edges  of  a  network  by  its  transition  matrix,  M,  are  the 
probabilities  that  the  surfer  goes  from  one  webpage  to  another.  We  can  understand  the 
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Fig.  8.1 0  The  Facebook 
network  drawn  using  the 
Spring  layout  and  PageRank 
to  define  the  node  size 


rank  computation  as  a  random  walk  through  the  network.  We  start  with  an  initial  equal 
probability  for  each  page:  uo  =  (^,...,^),  where  n  is  the  number  of  nodes.  Then 
we  can  compute  the  probability  that  each  page  is  visited  after  one  step  by  applying 
the  transition  matrix:  v\  =  Mv.  The  probability  that  each  page  will  be  visited  after 
k  steps  is  given  by  Vk  =  Mka.  After  several  steps,  the  sequence  converges  to  a 
unique  probabilistic  vector  a*  which  is  the  PageRank  vector.  The  i- th  element  of 
this  vector  is  the  probability  that  at  each  moment  the  surfer  visits  page  Pi .  We  need  a 
nonambiguous  definition  of  the  rank  of  a  page  for  any  directed  web  graph.  However, 
in  the  Internet,  we  can  expect  to  find  pages  that  do  not  contain  outgoing  links  and 
this  configuration  can  lead  to  certain  problems  to  the  explained  procedure.  In  order 
to  overcome  this  problem,  the  algorithm  fixes  a  positive  constant  p  between  0  and 
1  (a  typical  value  for  p  is  0.85)  and  redefines  the  transition  matrix  of  the  graph  by 
R  =  (1  —  p)  M  p  B,  where  B  =  ^7,  and  I  is  the  identity  matrix.  Therefore,  a 
node  with  no  outgoing  edges  has  probability  ^  of  moving  to  any  other  node. 

Let  us  compute  the  PageRank  vector  of  the  Facebook  network  and  use  it  to  define 
the  size  of  the  nodes,  as  was  done  in  box  In  [11]. 


The  code  above  outputs  the  graph  in  Fig.  8.10,  that  emphasizes  some  of  the  nodes 
with  high  PageRank.  Looking  the  graph  carefully  one  can  realize  that  there  is  one 
large  node  per  community. 
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8.5  Ego-Networks 

Ego-networks  are  subnetworks  of  neighbors  that  are  centered  on  a  certain  node.  In 
Facebook  and  Linkedln,  these  are  described  as  “your  network".  Every  person  in  an 
ego-network  has  her/his  own  ego-network  and  can  only  access  the  nodes  in  it.  All 
ego-networks  interlock  to  form  the  whole  social  network.  The  ego-network  definition 
depends  on  the  network  distance  considered.  In  the  basic  case,  a  distance  of  1,  a  link 
means  that  person  A  is  a  friends  of  person  B,  a  distance  of  2  means  that  a  person,  C,  is 
a  friend  of  a  friend  of  A,  and  a  distance  of  3  means  that  another  person,  D,  is  a  friend 
of  a  friend  of  a  friend  of  A.  Knowing  the  size  of  an  ego-network  is  important  when 
it  comes  to  understanding  the  reach  of  the  information  that  a  person  can  transmit  or 
have  access  to.  Figure  8.11  shows  an  example  of  an  ego-network.  The  blue  node  is 
the  ego ,  while  the  rest  of  the  nodes  are  red. 

Our  Facebook  network  was  manually  labeled  by  users  into  a  set  of  10  ego- 
networks.  The  public  dataset  includes  the  information  of  these  10  manually  defined 
ego-networks.  In  particular,  we  have  available  the  list  of  the  10  ego  nodes:  ‘O’,  ‘107’, 


ego-networks  are  interconnected  to  form  the  fully  connected  graph  we  have  been 
analyzing  in  previous  sections. 


In  Sect.  8.4  we  saw  that  node  ‘107’  is  the  most  central  node  of  the  Facebook 


network  for  three  of  the  four  centrality  measures  computed.  So,  let  us  extract  the 
ego-networks  of  the  popular  node  4 107’  with  a  distance  of  1  and  2,  and  compute  their 
sizes.  NetworkX  has  a  function  devoted  to  this  task: 


In  [13] : 


ego_107  =  nx  .  ego_graph ( fb  ,  '107') 

print  '#  nodes  of  ego  graph  107: 

1 en ( ego_l 0  7 ) 

print  '#  nodes  of  ego  graph  107  with  radius  up  to  2:  '  , 
1 en ( nx . ego_graph ( fb ,  '107',  radius  =  2)) 


Fig.  8.1 1  Example  of  an 
ego-network.  The  blue  node 
is  the  ego 
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Out [13] :  #  nodes  of  ego  graph  107:  1046 

#  nodes  of  ego  graph  107  with  radius  up  to  2:  2687 

The  ego-network  size  is  1,046  with  a  distance  of  1,  but  when  we  expand  the 
distance  to  2,  node  ‘107’  is  able  to  reach  up  to  2,687  nodes.  That  is  quite  a  large 
ego-network,  containing  more  than  half  of  the  total  number  of  nodes. 

Since  the  dataset  also  provides  the  previously  labeled  ego-networks,  we  can  com¬ 
pute  the  actual  size  of  the  ego-network  following  the  user  labeling.  We  can  access 
the  ego-networks  by  simply  importing  os  .path  and  reading  the  edge  list  corre¬ 
sponding,  for  instance,  to  node  ‘107’,  as  in  the  following  code: 


Out [ 14 ]: Nodes  of  the  ego  graph  107:  1034 

As  can  be  seen,  the  size  of  the  previously  defined  ego-network  of  node  ‘107’  is 
slightly  different  from  the  ego-network  automatically  computed  using  NetworkX. 
This  is  due  to  the  fact  that  the  manual  labeling  is  not  necessarily  referred  to  the 
subgraph  of  neighbors  at  a  distance  of  1 . 

We  can  now  answer  some  other  questions  about  the  structure  of  the  Facebook 
network  and  compare  the  10  different  ego-networks  among  them.  First,  we  can 
compute  which  the  most  densely  connected  ego-network  is  from  the  total  of  10.  To 
do  that,  in  the  code  below,  we  compute  the  number  of  edges  in  every  ego-network 
and  select  the  network  with  the  maximum  number: 
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Out [15] : The  most  densely  connected  ego-network  is  that  of  node:  1912 
Nodes:  747 
Edges:  30025 
Average  degree:  40 

The  most  densely  connected  ego-network  is  that  of  node  ‘1912’,  which  has  an 
average  degree  of  40.  We  can  also  compute  which  is  the  largest  (in  number  of  nodes) 
ego-network,  changing  the  measure  of  sizes  from  G .  size  ( )  by  G .  order  ( ) .  In 
this  case,  we  obtain  that  the  largest  ego-network  is  that  of  node  4 107’,  which  has 
1,034  nodes  and  an  average  degree  of  25. 

Next  let  us  work  out  how  much  intersection  exists  between  the  ego-networks  in 
the  Facebook  network.  To  do  this,  in  the  code  below,  we  add  a  field  ‘egonet’  for  every 
node  and  store  an  array  with  the  ego-networks  the  node  belongs  to.  Then,  having  the 
length  of  these  arrays,  we  compute  the  number  of  nodes  that  belong  to  1,  2,  3,  4  and 
more  than  4  ego-networks: 
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In  [16] : 


#  Add  a  field  ' egonet '  to  the  nodes  of  the  whole  facebook 

network  . 

#  Default  value  egonet  =  [] ,  meaning  that  this  node  does  not 

belong  to  any  ego-netowrk 
for  i  in  fb . nodes  ()  : 

fb . node  [ str ( i )  ]  [  ' egonet  '  ]  =  [] 

#  Fill  the  'egonet'  field  with  one  of  the  10  ego  values  in 

ego_ids : 

for  id  in  ego_ids  : 

G  =  nx . r ead_edge 1 i s t ( 

os .path. join ( ' files / ch08 / facebook ' , 

'  {0}  .  edges  '  .  format  (id)  )  , 
node type  =  int ) 
print  id 

for  n  in  G. nodes  (  )  : 

if  (fb. node  [str  (n)]  ['egonet']  ==  [])  : 

fb . node [ str (n) ] [ 'egonet']  =  [id] 

else  : 

fb.node [str (n) ] [ ' egonet ' ] . append (id) 

#  Compute  the  intersections: 


S  =  [ 1 en ( x [' egonet  ' 

]  ) 

for  x  in  f b  . 

node . values ( ) ] 

print 

'  # 

nodes 

into 

0 

ego  -  network : 

' ,  sum ( equal ( S , 

0) 

) 

print 

'  # 

nodes 

into 

1 

ego  -  network : 

' ,  sum ( equal ( S , 

1) 

) 

print 

'  # 

nodes 

into 

2 

ego  -  network : 

' ,  sum ( equal ( S , 

2) 

) 

print 

'  # 

nodes 

into 

3 

ego  -  network : 

' ,  sum ( equal ( S , 

3) 

) 

print 

'  # 

nodes 

into 

4 

ego  -  network : 

' ,  sum ( equal ( S , 

4) 

) 

print 

'  # 

nodes 

into 

more  than  4  ego-network:  ' ,\ 

sum ( great er ( S ,  4 


Out  [  16 ]  :  # 

nodes 

into 

0 

ego-network : 

80 

# 

nodes 

into 

1 

ego-network : 

3844 

# 

nodes 

into 

2 

ego-network : 

102 

# 

nodes 

into 

3 

ego-network : 

11 

# 

nodes 

into 

4 

ego-network : 

2 

# 

nodes 

into 

more  than  4  ego 

^-network:  0 

As  can  be  seen,  there  is  an  intersection  between  the  ego-networks  in  the  Facebook 
network,  since  some  of  the  nodes  belong  to  more  than  1  and  up  to  4  ego-networks 
simultaneously. 

We  can  also  try  to  visualize  the  different  ego-networks.  In  the  following  code, 
we  draw  the  ego-networks  using  different  colors  on  the  whole  Facebook  network 
and  we  obtain  the  graph  in  Fig.  8.12.  As  can  be  seen,  the  ego-networks  clearly  form 
groups  of  nodes  that  can  be  seen  as  communities. 
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Fig.  8.1 2  The  Facebook 
network  drawn  using  the 
Spring  layout  and  different 
colors  to  separate  the 
ego-networks 


In 


[17]  : 


C  \ 

#  Add  a  field  ' egocolor '  to  the  nodes  of  the  whole  facebook 
network . 

#  Default  value  egocolor  r  =0,  meaning  that  this  node 
does  not  belong  to  any  ego-netowrk  for  i  in  fb. nodes ()  : 

fb.node  [str(i)  ]  [  'egocolor'  ]  =  0 

#  Fill  the  'egocolor'  field  with  a  different  color  number 

for  each  ego-network  in  ego_ids: 
i dC  o 1 o  r  =  1 

for  id  in  ego_ids  : 

G  =  nx . r ead_edge 1 i s t ( 

os . path . join  (  '  files / ch08 / facebook  '  , 

'  {0}  .  edges  '  .  format  (id)  )  , 
node type  =  int ) 
for  n  in  G. nodes  (  )  : 

fb.node [str (n) ] [ 'egocolor']  =  idColor 
i dC  o 1 o  r  +  =  1 

colors  =  [  x  [  'egocolor']  for  x  in  fb . node . values  ( )  ] 

ns i ze  =  np . array ( [v  for  v  in  degree_cent_fb . values () ] ) 

nsize  =  500*  (  nsize  -  min  (nsize  )  )  /  (max  (nsize  )  -  min  (nsize  )  ) 

nodes  =  nx . dr aw_ne two r kx_node s ( 
fb ,  pos  =  pos_fb , 

cmap  =  pit . get_cmap ( ' Paired ' ) , 
node_color  =  colors, 
node_size  =  nsize, 
with_labels  =  False) 

edges =nx . draw_networkx_edges ( fb ,  pos  =  pos_fb ,  alpha  =  .1) 

\ _ / 


However,  the  graph  in  Fig.  8.12  does  not  illustrate  how  much  overlap  is  there 
between  the  ego-networks.  To  do  that,  we  can  visualize  the  intersection  between 
ego-networks  using  a  Venn  or  an  Euler  diagram.  Both  diagrams  are  useful  in  order  to 
see  how  networks  are  related.  Figure  8.13  shows  the  Venn  diagram  of  the  Facebook 
network.  This  powerful  and  complex  graph  cannot  be  easily  built  in  Python  tool- 
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Fig.  8.1 3  Venn  diagram. 

The  area  is  weighted 
according  to  the  number  of 
friends  in  each  ego-network 
and  the  intersection  between 
ego-networks  is  related  to 
the  number  of  common  users 

3 

■J 

0 


boxes  like  NetworkX  or  Matplotlib.  In  order  to  create  it,  we  have  used  a  JavaScript 
visualization  library  called  D3.JS. 


8.6  Community  Detection 


In  [18] 


A  community  in  a  network  can  be  seen  as  a  set  of  nodes  of  the  network  that  is  densely 
connected  internally.  The  detection  of  communities  in  a  network  is  a  difficult  task 
since  the  number  and  sizes  of  communities  are  usually  unknown  [3]. 

Several  methods  for  community  detection  have  been  developed.  Here,  we  apply 
one  of  the  methods  to  automatically  extract  communities  from  the  Facebook  network. 
We  import  the  Community  toolbox4 5  which  implements  the  Louvain  method  for 
community  detection.  In  the  code  below,  we  compute  the  best  partition  and  plot  the 
resulting  communities  in  the  whole  Facebook  network  with  different  colors,  as  we 

did  in  box  In  [  17  ] .  The  resulting  graph  is  shown  in  Fig.  8.14. 

/ - \ 

import  community  partition  =  community . best_partition (fb) 

print  "  # 

c  ommun i ties  found:",  max  (partition . values  ()  )  colors2  = 
[partition. get  (node)  for  node  in  fb  .  nodes  (  )  ]  nsize  =  np  . 
array  (  [  v 

for  v  in  degree_cent_fb . values ()] )  nsize  =  500* ( nsize 

min (nsize)  )  /  (max (nsize)  -  min  (nsize)  )  nodes  = 
nx . draw_networkx_nodes ( 
fb ,  pos  =  pos_fb , 

cmap  =  pit . get_cmap ( ' Paired ' )  , 
node_color  =  colors2 , 
node_size  =  nsize, 
with_labels  =  False) 

edges  =  nx  .  draw_ne tworkx_edges  (  fb  ,  pos  =  pos_fb  ,  alpha  =  .1) 

\ _ / 


4https://d3js.org. 

5  http :  //per  so .  cran  s .  org/  ay  naud/communities/ . 
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Fig.  8.1 4  The  Facebook 
network  drawn  using  the 
Spring  layout  and  different 
colors  to  separate  the 
communities  found 


Out [18] :  #  communities  found:  15 

As  can  be  seen,  the  15  communities  found  automatically  are  similar  to  the  10  ego- 
networks  loaded  from  the  dataset  (Fig.  8.12).  However,  some  of  the  10  ego-networks 
are  subdivided  into  several  communities  now.  This  discrepancy  can  be  due  to  the 
fact  that  the  ego-networks  are  manually  annotated  based  on  more  properties  of  the 
nodes,  whereas  communities  are  extracted  based  only  on  the  graph  information. 


8.7  Conclusions 

In  this  chapter,  we  have  introduced  network  analysis  and  a  Python  toolbox  (Net- 
workX)  that  is  useful  for  this  analysis.  We  have  shown  how  network  analysis  allows 
us  to  extract  properties  from  the  data  that  would  be  hard  to  discover  by  other  means. 
Some  of  these  properties  are  basic  concepts  in  social  network  analysis,  such  as 
centrality  measures  which  return  the  importance  of  the  nodes  in  the  network  or  ego- 
networks  which  allows  us  to  study  the  reach  of  the  information  a  node  can  transmit 
or  have  access  to.  The  different  concepts  have  been  practically  illustrated  by  a  prac¬ 
tical  case  dealing  with  a  Facebook  network.  In  this  practical  case,  we  have  resolved 
several  issues,  such  as  finding  the  most  representative  members  of  the  network  in 
terms  of  the  most  “connected”,  the  most  “circulated”,  the  “closest”,  or  the  most 
“accessible”  nodes  to  the  others.  We  have  presented  useful  ways  of  extracting  basic 
properties  of  the  Facebook  network,  and  studying  its  ego-networks  and  communities, 
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as  well  as  comparing  them  quantitatively  and  qualitatively.  We  have  also  proposed 
several  visualizations  of  the  graph  to  represent  several  measures  and  to  emphasize 
the  important  nodes  with  different  meanings. 
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Recommender  Systems 


9.1  Introduction 

In  this  chapter,  we  will  see  what  are  recommender  systems,  how  they  work,  and  how 
they  can  be  implemented.  We  will  also  see  the  different  paradigms  of  recommender 
systems  based  on  the  information  they  use,  as  well  as  the  output  they  produce.  We 
will  consider  typical  questions  that  companies  like  Netflix  or  Amazon  include  in 
their  products:  Which  movie  should  I  rent?  Which  TV  should  I  buy?  and  we  will 
give  some  insights  in  order  to  deal  with  more  complex  questions:  Which  is  the  best 
place  for  me  and  my  family  to  travel  to? 

So,  the  first  question  we  should  answer:  What  is  a  recommender  system?  It  can 
be  defined  as  a  tool  designed  to  interact  with  large  and  complex  information  spaces, 
and  to  provide  information  or  items  that  are  likely  to  be  of  interest  to  the  user,  in  an 
automated  fashion.  We  refer  to  complex  information  space  to  the  set  of  items,  and 
its  characteristics,  which  the  system  recommends  to  the  user,  i.e.,  books,  movies,  or 
city  trips. 

Nowadays,  recommender  systems  are  extremely  common,  and  are  applied  in  a 
large  variety  of  applications.  Perhaps  one  of  the  most  popular  types  are  the  movie 
recommender  systems  in  applications  used  by  companies  such  as  Netflix,  and  the 
music  recommenders  in  Pandora  or  Spotify,  as  well  as  any  kind  of  product  recom¬ 
mendation  from  Amazon.com.  However,  the  truth  is  that  recommender  systems  are 
present  in  a  huge  variety  of  applications,  such  as  movies,  music,  news,  books,  re¬ 
search  papers,  search  queries,  social  tags,  and  products  in  general,  but  they  are  also 
present  in  more  sophisticated  products  where  personalization  is  critical,  like  recom¬ 
mender  systems  for  restaurants,  financial  services,  life  assurance,  online  dating,  and 
Twitter  followers. 

Why  and  When  Do  We  Need  a  Recommender  System? 

In  this  new  era,  where  the  quantity  of  information  is  huge,  recommender  systems 
are  extremely  useful  in  several  domains.  People  are  not  able  to  be  experts  in  all 
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these  domains  in  which  they  are  users,  and  they  do  not  have  enough  time  to  spend 
looking  for  the  perfect  TV  or  book  to  buy.  Particularly,  recommender  systems  are 
really  interesting  when  dealing  with  the  following  issues: 

•  solutions  for  large  amounts  of  good  data; 

•  reduction  of  cognitive  load  on  the  user; 

•  allowing  new  items  to  be  revealed  to  users. 


9.2  How  Do  Recommender  Systems  Work? 

There  are  several  different  ways  to  build  a  recommender  system.  However,  most  of 
them  take  one  of  two  basic  approaches:  content-based  filtering  (CBF)  or  collabora¬ 
tive  filtering  (CF). 


9.2.1  Content-Based  Filtering 

CBF  methods  are  constructed  behind  the  following  paradigm:  “Show  me  more  of 
the  same  what  I’ve  liked”.  So,  this  approach  will  recommend  items  which  are  similar 
to  those  the  user  liked  before  and  the  recommendations  are  based  on  descriptions 
of  items  and  a  profile  of  the  user’s  preferences.  The  computation  of  the  similarity 
between  items  is  the  most  important  part  of  these  methods  and  it  is  based  on  the 
content  of  the  items  themselves.  As  the  content  of  the  item  can  be  very  diverse,  and  it 
usually  depends  on  the  kind  of  items  the  system  recommends,  a  range  of  sophisticated 
algorithms  are  usually  used  to  abstract  features  from  items.  When  dealing  with  textual 
information  such  as  books  or  news,  a  widely  used  algorithm  is  tf-idf  representation. 
The  term  tf-idf  refers  to  frequency-inverse  document  frequency,  it  is  a  numerical 
statistic  that  measures  how  important  a  word  is  to  a  document  in  a  collection  or 
corpus. 

An  interesting  content-based  filtering  system  is  Pandora.  This  music  recom¬ 
mender  system  uses  up  to  400  songs  and  artist  properties  in  order  to  find  similar 
songs  to  recommend  to  the  original  seed.  These  properties  are  a  subset  of  the  fea¬ 
tures  studied  by  musicologists  in  The  Music  Genome  Project  who  describe  a  song 
in  terms  of  its  melody,  harmony,  rhythm,  and  instrumentation  as  well  as  its  form  and 
the  vocal  performance. 


1  http://www.pandora.com/. 
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9.2.2  Collaborative  Filtering 

CF  methods  are  constructed  behind  the  following  paradigm:  “Tell  me  what’s  popular 
among  my  like-minded  users”.  This  is  really  intuitive  paradigm  since  it  is  really 
similar  of  what  people  use  to  do:  ask  or  look  at  the  preferences  of  the  people  they 
trust.  An  important  working  hypothesis  behind  these  kind  of  recommenders  is  that 
similar  users  tend  to  like  similar  items.  In  order  to  do  so,  these  approaches  are  based 
on  collecting  and  analyzing  a  large  number  of  data  related  to  the  behavior,  activities, 
or  tastes  of  users,  and  predicting  what  users  will  like  based  on  their  similarity  to  other 
users.  One  of  the  main  advantages  of  this  type  of  system  is  that  it  does  not  need  to 
“understand”  what  the  item  it  recommends  is. 

Nowadays,  these  methods  are  extremely  popular  because  of  the  simplicity  and 
the  large  amount  of  data  available  from  users.  The  main  drawbacks  of  this  kind  of 
method  is  the  need  for  a  user  community,  as  well  as  the  cold-start  effect  for  new 
users  in  the  community.  The  cold-start  problem  appears  when  the  system  cannot 
draw  any,  or  an  optimal,  inference  or  recommendation  for  the  users  (or  items)  since 
it  has  not  yet  obtained  the  sufficient  information  of  them. 

CF  can  be  of  two  types:  user-based  or  item-based. 

•  User-based  CF  works  like  this:  Find  similar  users  to  me  and  recommend  what  they 
liked.  In  this  method,  given  a  user,  U ,  we  first  find  a  set  of  other  users,  D,  whose 
ratings  are  similar  to  the  ratings  of  U  and  then  we  calculate  a  prediction  for  U. 

•  Item-based  CF  works  like  this:  Find  similar  items  to  those  that  I  previously  liked. 
In  item-based  CF,  we  first  build  an  item-item  matrix  that  determines  relationships 
between  pairs  of  items;  then  using  this  matrix  and  data  on  the  current  user  U , 
we  infer  the  user’s  taste.  Typically,  this  approach  is  used  in  the  domain:  people 
who  buy  v  also  buy  y.  This  is  a  really  popular  approach  used  by  companies  like 
Amazon.  Moreover,  one  of  the  advantages  of  this  approach  is  that  items  usually 
do  not  change  much,  so  its  similarities  can  be  computed  offline. 


9.2.3  Hybrid  Recommenders 

Hybrid  approaches  can  be  implemented  in  several  ways:  by  making  content-based 
and  collaborative  predictions  separately  and  then  combining  them;  by  adding  content- 
based  capabilities  to  a  collaborative  approach  (and  vice  versa);  or  by  unifying  the 
approaches  into  one  model. 


9.3  Modeling  User  Preferences 

Both,  CBF  and  CF  recommender  systems,  require  to  understand  the  user  prefer¬ 
ences.  Understanding  how  to  model  the  user  preference  is  a  critical  step  due  to  the 
variety  of  sources.  It  is  not  the  same  when  we  deal  with  applications  like  the  movie 
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recommender  from  Netflix,  where  the  users  rank  the  movies  with  1  to  5  stars;  or 
as  dealing  with  any  product  recommender  system  from  Amazon,  where  usually  the 
tracking  information  of  the  purchases  is  used.  In  this  case,  three  values  can  be  used: 
0  -  not  bought;  1  -  viewed;  2  -  bought. 

The  most  common  types  of  labels  used  to  estimate  the  user  preferences  are: 

•  Boolean  expressions  (is  bought?;  is  viewed?) 

•  Numerical  expressions  (e.g.,  star  ranking) 

•  Up-Down  expressions  (e.g.,  like,  neutral,  or  dislike) 

•  Weighted  value  expressions  (e.g.,  number  of  reproductions  or  clicks) 

In  the  following  sections  of  this  chapter,  we  only  consider  the  numerical  expression 
described  as  stars  on  the  scale  of  1  to  5. 


9.4  Evaluating  Recommenders 

The  evaluation  of  the  recommender  systems  is  another  important  step  in  order  to 
assess  the  effectiveness  of  the  method.  When  dealing  with  numerical  labels,  as  the 
5-star  ratings,  the  most  common  way  to  validate  a  recommender  system  is  based 
on  their  prediction  value,  i.e.,  the  capacity  to  predict  the  user’s  choices.  Standard 
functions  such  as  root  mean  square  error  (RMSE),  precision ,  recall ,  or  ROC/cost 
curves  have  been  extensively  used. 

However,  there  are  several  other  ways  to  evaluate  the  systems.  It  is  because  metrics 
are  entirely  relevant  to  point  of  view  of  the  person  who  has  to  evaluate  it.  Imagine 
the  following  three  persons:  (a)  a  marketing  guy;  (b)  a  technical  system  designer; 
and  (c)  a  final  user.  It  is  clear  that  what  is  relevant  for  all  of  them  is  not  the  same. 
For  a  marketing  guy,  what  is  usually  important  is  how  the  system  helps  to  push  the 
product,  for  the  technical  system  designer  is  how  efficient  is  the  algorithm,  and  for 
the  final  user  is  if  the  system  gives  him  good,  or  mostly  cool,  results.  In  the  literature 
we  can  see  two  main  typologies:  offline  and  online  evaluation. 

We  refer  to  evaluation  as  offline  when  a  set  of  labeled  data  is  obtained  and  then 
divided  into  two  sets:  a  training  set  and  a  test  set.  The  training  set  is  used  to  create  the 
model  and  adjust  all  the  parameters;  while  the  test  set  is  used  to  determine  selected 
evaluation  metrics.  As  mentioned  above,  standard  metrics  such  as  RMSE,  preci¬ 
sion,  and  recall  are  extensively  used,  but  recently  other  indirect  functions  have  also 
started  to  be  widely  considered.  Examples  of  these:  diversity,  novelty,  coverage,  cold- 
start,  or  serendipity,  the  latter  is  a  quite  popular  metric  that  evaluates  how  surprising 
the  recommendations  are.  For  further  discussion  of  this  field,  the  reader  is  referred 
to  [1]. 
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We  refer  to  evaluation  as  online  when  a  set  of  tools  is  used  that  allows  us  to  look  at 
the  interactions  of  users  with  the  system.  The  most  common  online  technique  is  called 
A-B  testing  and  has  the  benefit  of  allowing  evaluation  of  the  system  at  the  same  time 
as  users  are  learning,  buying,  or  playing  with  the  recommender  system.  This  brings 
the  evaluation  closer  to  the  actual  working  of  the  system  and  makes  it  really  effective 
when  the  purpose  of  the  system  is  to  change  or  influence  the  behavior  of  users.  In 
order  to  evaluate  the  test,  we  are  interested  in  measuring  how  user  behavior  changes 
when  the  user  is  interacting  with  different  recommender  systems.  Let  us  give  an 
example:  imagine  we  want  to  develop  a  music  recommender  system  like  Pandora, 
where  your  final  goal  is  none  other  than  for  users  to  love  your  intelligent  music 
station  and  spend  more  time  listening  to  it.  In  such  a  situation,  offline  metrics  like 
RMSE  are  not  good  enough.  In  this  case,  we  are  particularly  interested  in  evaluation 
of  the  global  goal  of  the  recommender  system  as  it  is  the  long-term  profit  or  user 
retention. 


9.5  Practical  Case 

In  this  section,  we  will  play  with  a  real  dataset  to  implement  a  movie  recommender 
system.  We  will  work  with  a  user-based  collaborative  system  with  the  MovieLens 
dataset. 


9.5.1  MovieLens  Dataset 

MovieLens  datasets  are  a  collection  of  movie  ratings  produced  by  hundreds  of  users 
collected  by  the  GroupLens  Research  Project  at  the  University  of  Minnesota  and 
released  into  the  public  domain.  Several  versions  of  this  dataset  can  be  found  at  the 
GroupLens  site.  Figure  9.1  shows  a  capture  of  this  website. 

Although  performance  on  bigger  dataset  is  expected  to  be  better,  we  will  work 
with  the  smallest  dataset:  MovieLens  100K  Dataset.  Working  with  this  lite  version 
has  the  benefit  of  less  computational  costs,  while  we  will  also  get  the  basic  skills 
required  on  user-based  recommender  systems. 

Once  you  have  downloaded  and  unzipped  the  file  into  a  directory,  you  can  create 
a  Pandas  DataFrame  with  the  following  code: 


2  http :  //grouplens .  org/datasets/movielens/ . 
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greupiens  about  da  taMti  publications  blog 


Movie  Lens 


GroupLens  Research  has  collected  and  made  avertable  rating  dans  sets  rrom  me  MovaeLens  tveb  Site 

me  data  sets  v^ere  deflected  over  various  periods  or  time,  depending  on  ihe  size  ol  the 
set.  Before  using  these  data  sets,  please  review  meir  readme  files  tor  me  usage  licenses  and  other  details. 

Help  our  research  lab:  Please  about  the  MowieLens  datasets 


MovieLens  10QK.  Dataset 

Stable  bench  nrtajrk  dataset.  100,000  ratings  Irom  10CO  users  on  \  700  movies.  Released  4/1098. 

*  README-  tort 

*  (Size;  SMB. 

*  ■-  n  PX  of  T  7  powjf  ■ :  S 

PenmaJirdc  h;;o  y/qroupiens.o^^  i  QQfr/ 


Movie  Lens  im  Dataset 

Stable  benchmark  dataset- 1  nniir.on  ratings  from  ■6000  users  on  4000  mov.es-  Released  2/2003. 

-  BEAQMLM 

*  {size:  6  MB, 

Perm  slink; 


MovieLens  lOM  Dataset 

Stable  benchmark  dataset.  10  million  ratings  and  100.000  tag  applications  applied  to  1 0,000  movies  by  72.000  users. 
Released  1/2009. 

*  README  n'j-'-i 

*  (size:  &3  mb. 


Fig.  9.1  Grouplens  website 


In  [1]  : 


f  \ 

#  Load  user  data 

u_cols  =  [ 

' user_id ' ,  'age',  'sex', 

'occupation',  ' zip_code ' 

] 

users  =  pd. read_csv ( ' files/ch09 /ml -lOOk/u. user ' , 

s ep  =  '  |  '  , 
names  =  u_c  o 1 s ) 

#  Load  movie  data 

r_c  o 1 s  =  [ 

' user_id  '  ,  ' movie_id  '  , 

' rating '  ,  ' unix_timest amp  ' 

] 

ratings  =  pd  .  read_csv ('  files / ch0  9 /  ml -100k/u.data'  , 

s ep  = '  \ t  '  , 
names  =  r_c  o 1 s ) 

#  The  movie  file  contains  columns  indicating  the  genres  of 

the  movie 

#  We  will  only  load  the  first  three  columns  of  the  file  with 

u  s  e  c  o  1  s 

V _ / 


9.5  Practical  Case 


171 


In  [1]  : 


Out  [  1  ]  : 


The  DB  has  100000  ratings 

The  DB  has  943  different  users 

The  DB  has  1682  different  items 


user_id  title  movie_id  rating 


0  196  Kolya  (1996)  242 

1  305  Kolya  (1996)  242 

2  6  Kolya  (1996)  242 

3  234  Kolya  (1996)  242 

4  63  Kolya  (1996)  242 


3 
5 

4 
4 
3 


If  you  explore  the  dataset  in  detail,  you  will  see  that  it  consists  of: 


•  100,000  ratings  from  943  users  of  1682  movies.  Ratings  are  from  1  to  5. 

•  Each  user  has  rated  at  least  20  movies. 

•  Simple  demographic  info  for  the  users  (age,  gender,  occupation,  zip). 


9.5.2  User-Based  Collaborative  Filtering 

In  order  to  create  a  user-based  collaborative  recommender  system  we  must  define:  (1) 
a  prediction  function,  (2)  a  user  similarity  function,  and  (3)  an  evaluation  function. 

Prediction  Function 

The  prediction  function  behind  the  user-based  CF  will  be  based  on  the  movie  ratings 
from  similar  users.  So,  in  order  to  recommend  a  movie,  p ,  from  a  set  of  movies,  P, 
to  a  given  user,  a ,  we  first  need  to  see  the  set  of  users,  B ,  who  have  already  seen  p. 
Then,  we  need  to  see  the  taste  similarity  between  these  users  in  B  and  user  a.  The 
most  simple  prediction  function  for  a  user  a  and  movie  p  can  be  defined  as  follows: 

beBsim(a,b)(rb'p) 

Efceg  sim(a,  b) 


pred(a,  p)  = 


E 


(9.1) 
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Table  9.1  Recommender  System 


Critic 

sim(a,b) 

Rating  movie  1:  rb,pi 

sim(a,  b)(rb,pi) 

Paul 

0.99 

3 

2.97 

Alice 

0.38 

3 

1.14 

Marc 

0.89 

4.5 

4.0 

Anne 

0.92 

3 

2.77 

J2b&Nsim(a’  b)(rb,p) 

10.87 

HbcNsim(a’  b) 

3.18 

pred(a,  p ) 

3.41 

where  sim(a,  b)  is  the  similarity  between  user  a  and  user  b ,  B  is  the  set  of  users  in 
the  dataset  that  have  already  seen  p  and  r is  the  rating  of  p  by  b. 

Let  us  give  an  example  (see  Table  9.1).  Imagine  the  system  can  only  recommend 
one  movie,  since  the  rest  have  already  been  seen  by  the  user.  So,  we  only  want  to 
estimate  the  score  corresponding  to  that  movie.  The  movie  has  been  seen  by  Paul, 
Alice,  Marc,  and  Anne  and  scored  3,3,4,  and  3,  respectively.  Similarity  between  user 
a  and  Paul,  Alice,  Marc,  and  Anne  has  been  computed  “somehow”  (we  will  see  later 
how  we  can  compute  it)  and  the  values  are  0.99,  0.38,  0.89,  and  0.92,  respectively.  If 
we  follow  the  previous  equation,  the  estimated  score  is  3.41,  as  seen  in  Table 9.1. 

User  Similarity  Function 

The  computation  of  the  similarity  between  users  is  one  of  the  most  critical  steps  in 
the  CF  algorithms.  The  basic  idea  behind  the  similarity  computation  between  two 
users  a  and  b  is  that  we  can  first  isolate  the  set  P  of  items  rated  by  both  users,  and 
then  apply  a  similarity  computation  technique  to  determine  the  similarity. 

The  set  of  coimmon_movies  can  be  obtained  with  the  following  code: 

#  dataframe  with  the  data  from  user  1 

df_usrl  =  data_train [ data_train . user_id  —  1] 

#  dataframe  with  the  data  from  user  2 

df_usr2  =  data_train [ data_train . user_id  ==  6] 

#  We  first  compute  the  set  of  common  movies 
common_mov  =  set  (  df_usrl  . movie_id)  .  i n t e r s e c t i on  ( 

df_usr2 . movie_id) 

print  " \nNumber  of  common  movies", 
len ( common_mov) 

V _ / 
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/ - \ 

#  Sub  -  da t a f r ame  with  only  the  common  movies 
mask  =  (data_user_l.movie_id.isin(c ommon_mo vi e  s )  ) 
data_user_l  =  data_user_l [mask] 

print  data_user_l [ [ 'title' ,  'rating' ] ] . head ( ) 

mask  =  ( data_user_2  . movie_id .  isin ( c ommon_mo vi e  s )  ) 

data_user_2  =  data_user_2 [mask] 

print  data_user_2 [ [ 'title' ,  'rating' ] ] . head ( ) 

\ _ / 


Out  [  2  ]  : 


Number  of  common  movies  11 
Movies  User  1 

title  rating 

14  Kolya  (1996)  5 

417  Shall  We  Dance?  (1996)  4 

1306  Truth  About  Cats  &  Dogs,  The  (1996)  5 

1618  Birdcage,  The  (1996)  4 

3479  Men  in  Black  (1997)  4 

Movies  User  2 

title  rating 

32  Kolya  (1996)  5 

424  Shall  We  Dance?  (1996)  5 

1336  Truth  About  Cats  &  Dogs,  The  (1996)  4 

1648  Birdcage,  The  (1996)  4 

3510  Men  in  Black  (1997)  4 


Once  the  set  of  ratings  for  all  movies  common  to  the  two  users  has  been  obtained, 
we  can  compute  the  user  similarity.  Some  of  the  most  common  similarity  functions 
used  in  CF  methods  are  as  follows: 

Euclidean  distance: 


sim(a,  b )  = 


1 


Pearson  correlation: 


sim(a,  b )  = 


1  +  ( ra,p  rb,p )2 

^2pep(ra,p  ~  ra)(rb,p  ~  rb ) 

yj^jpeP^^p  ~  ra^'I2pep(rb,p  ~  rb) 


where  ra  and  rp  are  the  mean  ratings  of  users  a  and  b. 

Cosine  distance: 

a  •  b 

sim(a,  b)  =  - 

\a\  • \b\ 


(9.2) 


(9.3) 


(9.4) 


Now,  the  question:  Which  function  should  we  use?  The  answer  is  that  there  is  no 
fixed  recipe;  but  there  are  some  issues  we  can  take  into  account  when  choosing  the 
proper  similarity  function.  On  the  one  hand,  Pearson  correlation  usually  works  better 
than  Euclidean  distance  since  it  is  based  more  on  the  ranking  than  on  the  values.  So, 
two  users  who  usually  like  more  the  same  set  of  items,  although  their  rating  is  on 
different  scales,  will  come  out  as  similar  users  with  Pearson  correlation  but  not  with 
Euclidean  distance.  On  the  other  hand,  when  dealing  with  binary/unary  data,  i.e., 


In  [3 


In  [  4 
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like  versus  not  like  or  buy  versus  not  buy,  instead  of  scalar  or 
cosine  distance  is  usually  used. 

Let  us  define  the  Euclidean  and  Pearson  functions: 

real  data  like  ratings, 

r 

from  s c ipy . spa t i a 1 . di s t anc e 

import 

Euc 1 i dean 

#  Similarity  based  on  Euclid 

ean  di s 

tance  for  u 

sers  1-2 

def  S imEuc 1 id ( df , User 1 , User2 

, mi n_c  o 

mmon_i t  ems  = 

10  )  : 

#  GET  MOVIES  OF  USERl 

mov_ul  =  df [df [ ' user_id ' 

]  ==  Us 

er  1  ] 

#  GET  MOVIES  OF  USER2 

mov_u2  =  df [df [ ' user_id ' 

]  ==  Us 

er2  ] 

#  FIND  SHARED  FILMS 

rep  =  pd . merge (mov_ul ,  m 

ov_u2  , 

on  =  ' movi e 

_id  '  ) 

i f  len  ( rep )  =  =  0  : 

return  0 

if (len(rep)  <  min_common 

_i t ems ) 

I 

return  0 

return  1.0  /  (1.0+euclid 

ean ( rep 

['rat i ng_x ' 

]  , 

V 

rep 

['rat ing_y  ' 

]  )  ) 

_ 

f - \ 

from  scipy. stats  import  pearsonr 

#  Similarity  based  on  Pearson  correlation  for  user  1-2 
def  SimPearson  ( df  ,  Userl  ,  User2  ,  min_c ommon_i t ems  =  10)  : 

#  GET  MOVIES  OF  USERl 

mov_ul  =  df [df [ ' user_id ' ]  ==  Userl  ] 

#  GET  MOVIES  OF  USER2 

mov_u2  =  df  [df  [  '  user_id  '  ]  ==  User2  ] 

#  FIND  SHARED  FILMS 

rep  =  pd  .  merge  (  mov_ul  ,  mov_u2  ,  on  =  '  movie_id  '  ) 

i  f  1  en  (  rep  )  =  =  0  : 

return  0 

if  ( len  (rep)  <  mi n_c  ommon_i terns)  : 

return  0 

return  pearsonr  ( rep  [  ' rat ing_x '  ]  ,  rep  ['  rat ing_y  '  ]  )  [  0 ] 

v _ y 


Figure  9.2  shows  the  correlation  plots  for  user  1  versus  user  8  and  user  1  versus 
user  3 1 .  Each  point  in  the  plots  corresponds  to  a  different  set  of  ratings  from  the  two 
users  of  the  same  movies.  The  bigger  the  dot,  the  larger  the  set  of  movies  rated  with 
the  corresponding  values.  We  can  observe  in  these  plots  that  ratings  from  user  1  are 
more  correlated  with  ratings  from  user  8  than  from  the  user  3 1 .  However,  as  we  can 
observe  in  the  following  outputs,  the  Euclidean  similarity  between  user  1  and  user 
31  is  closer  than  between  user  1  and  user  8. 


print 

"Euclidean  similarity 

" , SimEuclid (data_train 

,  1  , 

8) 

N 

print 

"Pearson  similarity", 

SimPearson (data_train , 

1  , 

8) 

print 

"Euclidean  similarity 

" , SimEuclid (data_train 

,  1  , 

3  1  ) 

print 

"Pearson  similarity", 

SimPearson (data_train , 

1  , 

3  1  ) 

_ y 

In  [  5  ]  : 
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Out  [  5  ]  : 


In  [  6  ]  : 
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Rating  User  1 

Rating  User  1 

(a)  User  1  vs.  8 

(b)  User  1  vs.  31 

Fig.  9.2  Similarity  between  users 

Euclidean  similarity  0.195194101601 
Pearson  similarity  0.773097845465 

Euclidean  similarity  0.240253073352 
Pearson  similarity  0.272165526976 


Evaluation 

In  order  to  validate  the  system,  we  will  divide  the  dataset  into  two  different  sets: 
one  called  X_train  containing  80%  of  the  data  from  each  user;  and  another  called 
X_test,  with  the  remaining  20%  of  the  data  from  each  user.  In  the  following  code 
we  create  a  function  as  s  ign_to_se  t  that  creates  a  new  column  in  the  DataFrame 
indicating  which  sample  it  belongs  to. 

f  \ 

def  a s s i gn_t o_s e t (df ) : 

sampled_ids  =  np . random . cho i c e ( 
df .  index  , 

size  =  np.int64 (np. ceil (df. index. size  *  0.2)), 

replace  =  Fal se ) 

df . ix [ sampl ed_ids ,  ' for_testing ' ]  =  True 

return  df 

data [ ' f or_testing ' ]  =  False 

grouped  =  data . groupby ( ' user_id ' ,  group_keys  =  False) 

. apply ( ass i gn_t  o_s  e  t ) 

X_train  =  data [ grouped . f or_testing  ==  False] 

X_test  =  data [ grouped . f or_tes t ing  ==  True] 

v _ 7 


The  resulting  X_train  and  X_tes  t  sets  have  79619  and  20381  ratings,  respec¬ 
tively. 

Once  the  data  is  divided  in  these  sets,  we  can  build  a  model  with  the  training  set 
and  evaluate  its  performance  using  the  test  set.  In  our  case,  the  evaluation  will  be 
performed  using  the  standard  RMSE: 


RMSE  = 
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In  [7]  : 


where  y  is  the  real  rating  and  y  is  the  predicted  rating. 

f - \ 

de  f  comput e_rms  e  ( y_pr ed ,  y_t rue )  : 

"""  Compute  Root  Mean  Squared  Error.  """ 

return  np  .  sqr t  (  np  .  mean  (  np  .  power  (  y_pr ed  -  y_t rue  ,  2  )  )  ) 

\ _ y 


Collaborative  Filtering  Class 


In  [8]  : 


We  can  define  our  recommender  system  with  a  Python  class.  This  class  consists  of 
a  constructor  and  two  methods:  fit  and  predict.  In  the  fit  method  the  user’s 
similarities  are  computed  and  stored  into  a  Python  dictionary.  This  is  a  really  simple 
method  but  quite  expensive  in  terms  of  computation  when  dealing  with  a  large  dataset. 
We  decided  to  show  one  of  the  most  basic  schemes  in  order  to  implement  it.  More 
complex  algorithms  can  be  used  in  order  to  improve  the  computations  cost.  Moreover, 
online  strategies  can  be  used  when  dealing  with  a  really  dynamic  problems.  In  the 
predict  the  score  for  a  movie  and  a  user  is  estimated. 


class 


"  "  "  C F  using  a  custom  sim (u, u  '  )  .  "  "  " 

def  _ init _ (self,  df ,  similarity  =  SimPearson) : 

" " "  Constructor  " " " 

self . sim_method  =  similarity 

self.df  =  df 

self. sim  =  pd. DataFrame ( 

np  .  sum  (  [  0  ]  )  ,  columns  =  df  .  user__id  .  unique  (  )  , 
index  =  df.user_id. unique ( ) ) 


def  fit  (self)  : 

"""  Prepare  data  structures  for  estimation. 

Similarity  matrix  for  users  """ 
allUsers  =  set ( self . df [ ' user_id ' ] ) 
self. sim  =  {} 

for  personl  in  allUsers: 

sel f  . s im .  se t de f aul t  ( personl  ,  {}) 

a  =  self . df [ 

self . df  [  ' user_id '  ]  ==  personl ]  [  [  'movie_id '  ] 

] 

data_reduced  =  pd . merge  ( s e  1  f  . df  ,  a, 

on  =  ' movie_id ' ) 

for  person2  in  allUsers: 

#  Avoid  our-self 

if  personl  ==  person2 :  continue 

self. sim. setdefault (person2 ,  {}) 

if ( self . sim [person2 ] . has_key (personl ) ) : 

continue  #  since  symmetric  matrix 
sim  =  self . sim_method ( data_reduced , 

personl  , 
person2 ) 

i f  ( s im  <  0 )  : 

sel f . sim [ personl ] [person2]  =  0 

self . sim [person2 ] [personl]  =  0 

else: 

sel f . sim [ personl ] [person2]  =  sim 
self . sim [person2 ] [personl]  =  sim 


def  predict (self ,  user_id ,  movie_id) : 
totals  =  {} 

users  =  sel f . df [ sel f . df [ ' movi e_id ' ]  ==  movie_id] 


> 
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In  [11] : 


In  [9]  : 


In  [10] : 


Out [ 10 ] 


rating_num,  rating_den  =  0.0,  0.0 

allUsers  =  set  (users  [  ' user_id '  ]  ) 
for  other  in  allUsers: 

if  user_id  ==  other:  continue 
rating_num  += 

self . sim [user_id] [other]  *  float (users [users 
[ ' user_id ' ]  ==  other] ['rating']) 

rating_den  +=  self . sim [user_id] [other] 
if  rating_den  ==  0: 

if  s e 1 f  . df . r a t i ng  [ s e 1 f  . df  [  ' mo vi e_i d '  ]  =  = 

movie_id] . mean ( )  >  0: 

#  Mean  movie  rating  if  there  is  no  similar 

for  the  computation 

return  s  e  1  f  .  df  .  r  a  t  i  ng  [  s  e  1  f  .  df  [  '  mo  vi  e_i  d  '  ]  =  = 

movie_id] . mean ( ) 

else: 

#  else  mean  user  rating 

return  self  .  df  .  rating  [ self . df  [  ' user_id '  ]  =  = 

user_id] . mean ( ) 
return  r a t i ng_num / r a t i ng_den 


For  the  evaluation  of  the  system  we  define  a  function  called  evaluate.  This 
function  estimates  the  score  for  all  items  in  the  test  set  (X_test)  and  compares 
them  with  the  real  values  using  the  RMSE. 


/ - \ 

def  evaluate  ( fit_f  ,  train  ,  test)  : 

"""  RMSE-based  predictive  performance  evaluation  with 
pandas .  " " " 

ids_to_es t imate  =  zip ( test . user_id ,  test .movie_id) 
estimated  =  np . array ( [ fit_f (u,  i) 

if  u 

in  train . user_id 

else  3 
for  ( u ,  i  ) 

in  ids_to_est imate  ]  ) 
real  =  t e s t . r a t i ng . va lue s 
return  compute_rmse ( estimated ,  real) 

V _ / 


Now,  the  system  can  be  executed  with  the  following  lines: 


print 

'  RMSE 

for  Collaborative  Recommender: ' , 

\ 

print 

'  %  s  ' 

%  e va 1 ua t e ( r e c o . f i t ,  data_train,  data_test) 

_ y 

RMSE  for  Collaborative  Recommender:  1.00468945461 


As  we  can  see,  the  obtained  RMSE  for  this  first  basic  recommender  system  is 
1.004.  Sure,  that  this  result  could  be  improved  with  a  bigger  dataset,  but  let  us  think 
of  how  we  can  improve  it  with  just  few  tricks: 

Trick  1:  Since  humans  do  not  usually  act  the  same  as  critics,  i.e.,  some  people 
usually  rank  movies  higher  or  lower  than  others,  this  prediction  function  can  be  easily 
improved  by  taking  into  account  the  user  mean  as  follows: 


pred(a,  p)  =  ra  + 


I IbeB  sim(a,  b)  *  {n,p 
T,bsB  sim(a,  b) 


where  ra  and  r/,  are  the  mean  rating  of  user  a  and  b. 


(9.6) 
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Table  9.2  Recommender  system  using  mean  user  ratings 


Critic 

sim(a,b) 

Mean  ratings: 
rb 

Rating 
moviel:  rb^pi 

sim(a,  b)  * 

( rb,p\ ) 

Paul 

0.99 

4.3 

3 

-1.28 

Alice 

0.38 

2.73 

3 

0.1 

Marc 

0.89 

3.12 

4.5 

1.22 

Anne 

0.92 

3.98 

3 

-0.9 

J2beN  sim(a,  b)  *  (rbiP  -  fb) 

-1.13 

HbeNsim(a ’  b) 

3.18 

pred(a,  p) 

3.14 

Let  us  see  an  example:  Prediction  for  the  user  “a”  with  Fa  =3.5  (Table 9.2) 

If  we  modify  the  recommender  system  using  Eq.  (9.6),  the  RMSE  obtained  is  the 
following: 

Out [ 11] : RMSE  for  Collaborative  Recommender:  0.950086206741 

Trick  2:  One  of  the  most  critical  steps  with  this  kind  of  recommender  system  is 
the  user  similarity  computation.  If  two  users  have  very  few  items  in  common,  let  us 
imagine  that  there  is  only  one,  and  the  rating  is  the  same,  the  user  similarity  will  be 
really  high;  however,  the  confidence  is  really  small.  In  order  to  solve  this  problem 
we  can  modify  the  similarity  function  as  follows: 

min(K,  \Pab\) 

*  - 

K 

where  |  Pab  |  is  the  number  of  common  items  shared  by  user  a  and  user  b ,  and  K  is  the 
minimum  number  of  common  items  in  order  not  to  penalize  the  similarity  function. 
In  the  next  code,  we  define  an  update  version  of  the  similarity  function  called 

simPersonCorrected  that  follows  the  Eq.9.7. 

/  \ 

def  SimPearsonCorrec ted  (  df ,  Userl  ,  User2  , 

mi n_c ommon_i t ems  =  1, 

pr e f_c ommon_i terns  =  20) : 

"""  RMSE-based  predictive  performance  evaluation  with 
pandas .  " " " 

#  GET  MOVIES  OF  USERl 


m_ 

_user 

1  =  df 

[ 

df  [ 

' u  s  er_ 

i 

d 

'  ] 

=  =  Us 

er  1  ] 

# 

GET 

MOVIES 

OF 

USER2 

m_ 

_user 

2  =  df 

[ 

df  [ 

' u  s  er_ 

i 

d 

'  ] 

=  =  Us 

er  2  ] 

# 

FIND 

SHARED 

FI 

LMS 

rep  = 

pd . me  r 

g 

e  ( m 

_us  e  r 1 

/ 

m_ 

user 2  , 

on  = 

if  1  en 

(  rep  ) 

= 

=  0 

; 

re 

turn  0 

if ( 1 en 

(  rep  ) 

< 

mi 

n_c  omm 

o 

n 

_  i 

terns )  : 

return  0 


_ _ 


(9.7) 


new_sim(a,  b)  =  sim(a,  b) 
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/ - \ 

res  =  pearsonr  (rep  [  '  rat  ing_x  '  ]  ,  rep  ['  rat  ing_y  '  ]  )  [  0  ] 
res  =  res  *  min (pre f_common_i terns ,  len(rep) ) 
res  =  res  /  pr e f _c ommon_i t ems 
i f  ( i snan  ( res )  )  : 

return  0 
return  res 

reco4  =  C o 1 1 abo r a t i veF i 1 t e r i ng3  ( 
data_t rain  , 

similarity  =  S i mP e a r s onC o r r e c t e d ) 
r e c o 4  .  learn  (  ) 

print  ' RMSE  for  Collaborative  Recommender: ' , 

print  '  %s'  %  evaluate ( reco4 . fit ,  data_train,  data_test) 

V _ / 


Out [ 12 ]:  RMSE  for  Collaborative  Recommender:  0.930811091922 

As  it  can  be  seen,  with  this  small  modification  the  RMSE  error  has  decreased 
from  1.0  to  0.93. 


9.6  Conclusions 

In  this  chapter,  we  have  introduced  what  are  recommender  systems,  how  they  work, 
and  how  they  can  be  implemented  in  Python.  We  have  seen  that  there  are  different 
types  of  recommender  systems  based  on  the  information  they  use,  as  well  as  the 
output  they  produce.  We  have  introduced  content-based  recommender  systems  and 
collaborative  recommender  systems;  and  we  have  seen  the  importance  of  defining 
the  similarity  function  between  items  and  users. 

We  have  learned  how  recommender  system  can  be  implemented  in  Python  in  order 
to  answer  questions  such  as  which  movie  should  I  see?  We  have  also  discussed  how 
recommender  system  should  be  evaluated,  and  several  online  and  offline  metrics. 

Finally,  we  have  worked  with  a  publicly  available  dataset  from  GroupLens  in 
order  to  implement  and  evaluate  a  collaborative  recommendation  system  for  movie 
recommendations . 
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Statistical  Natural  Language 
Processing  for  Sentiment  Analysis 


10.1  Introduction 

In  this  chapter,  we  will  perform  sentiment  analysis  from  text  data.  The  term  sentiment 
analysis  (or  opinion  mining)  refers  to  the  analysis  from  data  of  the  attitude  of  the 
subject  with  respect  to  a  particular  topic.  This  attitude  can  be  a  judgment  (appraisal 
theory),  an  affective  state,  or  the  intended  emotional  communication. 

Generally,  sentiment  analysis  is  performed  based  on  the  processing  of  natural 
language,  the  analysis  of  text  and  computational  linguistics.  Although  data  can  come 
from  different  data  sources,  in  this  chapter  we  will  analyze  sentiment  in  text  data, 
using  two  particular  text  data  examples:  one  from  film  critics,  where  the  text  is  highly 
structured  and  maintains  text  semantics;  and  another  example  coming  from  social 
networks  (tweets  in  this  case),  where  the  text  can  show  a  lack  of  structure  and  users 
may  use  (and  abuse!)  text  abbreviations. 

In  the  following  sections,  we  will  review  some  basic  mechanisms  required  to 
perform  sentiment  analysis.  In  particular,  we  will  analyze  the  steps  required  for 
data  cleaning  (that  is,  removing  irrelevant  text  items  not  associated  with  sentiment 
information),  producing  a  general  representation  of  the  text,  and  performing  some 
statistical  inference  on  the  text  represented  to  determine  positive  and  negative  senti¬ 
ments. 

Although  the  scope  of  sentiment  analysis  may  introduce  many  aspects  to  be  ana¬ 
lyzed,  in  this  chapter  and  for  simplicity,  we  will  analyze  binary  sentiment  analysis 
categorization  problems.  We  will  thus  basically  learn  to  classify  positive  against 
negative  opinions  from  text  data.  The  scope  of  sentiment  analysis  is  broader,  and  it 
includes  many  aspects  that  make  analysis  of  sentiments  a  challenging  task.  Some 
interesting  open  issues  in  this  topic  are  as  follows: 

•  Identification  of  sarcasm:  sometimes  without  knowing  the  personality  of  the  per¬ 
son,  you  do  not  know  whether  “bad”  means  bad  or  good. 
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•  Lack  of  text  structure:  in  the  case  of  Twitter,  for  example,  it  may  contain  abbre¬ 
viations,  and  there  may  be  a  lack  of  capitals,  poor  spelling,  poor  punctuation,  and 
poor  grammar,  all  of  which  make  it  difficult  to  analyze  the  text. 

•  Many  possible  sentiment  categories  and  degrees:  positive  and  negative  is  a  simple 
analysis,  one  would  like  to  identify  the  amount  of  hate  there  is  inside  the  opinion, 
how  much  happiness,  how  much  sadness,  etc. 

•  Identification  of  the  object  of  analysis:  many  concepts  can  appear  in  text,  and  how 
to  detect  the  object  that  the  opinion  is  positive  for  and  the  object  that  the  opinion  is 
negative  for  is  an  open  issue.  For  example,  if  you  say  “She  won  him!”,  this  means 
a  positive  sentiment  for  her  and  a  negative  sentiment  for  him,  at  the  same  time. 

•  Subjective  text:  another  open  challenge  is  how  to  analyze  very  subjective  sentences 
or  paragraphs.  Sometimes,  even  for  humans  it  is  very  hard  to  agree  on  the  sentiment 
of  these  highly  subjective  texts. 


10.2  Data  Cleaning 

In  order  to  perform  sentiment  analysis,  first  we  need  to  deal  with  some  processing 
steps  on  the  data.  Next,  we  will  apply  the  different  steps  on  simple  “toy”  sentences 
to  understand  better  each  one.  Later,  we  will  perform  the  whole  process  on  larger 
datasets. 

Given  the  input  text  data  in  cell  [  1  ] ,  the  main  task  of  data  cleaning  is  to  remove 
those  characters  considered  as  noise  in  the  data  mining  process.  For  instance,  comma 
or  colon  characters.  Of  course,  in  each  particular  data  mining  problem  different  char¬ 
acters  can  be  considered  as  noise,  depending  on  the  final  objective  of  the  analysis.  In 
our  case,  we  are  going  to  consider  that  all  punctuation  characters  should  be  removed, 
including  other  non-conventional  symbols.  In  order  to  perform  the  data  cleaning  pro¬ 
cess  and  posterior  text  representation  and  analysis  we  will  use  the  Natural  Language 
Toolkit  (NLTK)  library  for  the  examples  in  this  chapter. 


\ 

raw_docs  =  ["Here  are  some  very 

s imp 1 e  basic 

sentences  .  "  , 

"They  won't  be  very 

interesting  , 

I'm  afraid.", 

"The  point  of  these 

examples  is 

to  _learn  how 

basic  text  \ 

cleaning  works_  on  * 

very  s impl e  * 

data  .  "  ] 

V 

_ 

The  first  step  consists  of  defining  a  list  with  all  word- vectors  in  the  text.  NLTK 
makes  it  easy  to  convert  documents-as-strings  into  word-vectors,  a  process  called 
tokenizing.  See  the  example  below. 

f  \ 

from  nl tk . t okeni z e  import  wo rd_t okeni z e 
t o ken i z ed_do c s  =  [ word_t okeni z e ( doc )  for  doc  in 

r aw_do  c  s ] 

print  t okeni z ed_do c s 

V _ / 


In  [2] : 
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Out [ 2 ]  : 


In  [3]  : 


Out [ 3 ]  : 


In  [4] : 


[['Here',  'are',  'some',  'very',  'simple',  'basic', 

'sentences',  '.'],  ['They',  'wo',  "n't",  'be',  'very', 

'interesting',  ',',  'I',  M'm",  'afraid',  %'.'],  ['The', 

'point',  'of',  'these',  'examples',  'is',  'to',  '_learn' , 

'how',  %' basic',  'text',  'cleaning',  'works_' ,  'on',  ' *very' , 

'simple*',  'data',  '.']] 

Thus,  for  each  line  of  text  in  raw_docs,  word_tokenize  function  will  set 
the  list  of  word- vectors.  Now  we  can  search  the  list  for  punctuation  symbols,  for 
instance,  and  remove  them.  There  are  many  ways  to  perform  this  step.  Let  us  see 
one  possible  solution  using  the  String  library. 


r 

N 

import  string 

string . punctuation 

_ 

'  !  "#\$\%&\'  ()*+,-./:  ;<  =  >?&  [\\]  A_'  {  | 

See  that  string .  punctuation  contains  a  set  of  common  punctuation  sym¬ 
bols.  This  list  can  be  modified  according  to  the  symbols  you  want  to  remove.  Let  us 
see  with  the  next  example  using  the  Regular  Expressions  (RE)  package  how  punctu¬ 
ation  symbols  can  be  removed.  Note  that  many  other  possibilities  to  remove  symbols 
exist,  such  as  directly  implementing  a  loop  comparing  position  by  position. 

In  the  input  cell  [  6  ] ,  and  without  going  into  the  details  of  RE,  re  .  compile 
contains  a  list  of  “expressions”,  the  symbols  contained  in  string .  punctuation. 

Then,  for  each  item  in  tokenized_docs  that  matches  an  expression/symbol 
contained  in  regex,  the  part  of  the  item  corresponding  to  the  punctuation  will  be  sub¬ 
stituted  by  u  "  (where  u  refers  to  Unicode  encoding).  If  the  item  after  substitution  cor¬ 
responds  to  u  " ,  it  will  be  not  included  in  the  final  list.  If  the  new  item  is  different  from 
u  " ,  it  means  that  the  item  contained  text  other  than  punctuation,  and  thus  it  is  included 
in  the  new  list  without  punctuation  tokenized_docs_no_punctuation.  The 
results  of  applying  this  script  are  shown  in  the  output  cell  [  7  ] . 

import  re 
import  string 

regex  =  re.  compile  (  '  [  %  s  ]  '  %  re  .  escape  (string  . 

punc  tuation )  ) 

t oken i z ed_do c s_no_punc t ua t i on  =  [] 

for  review  in  t okeni z ed_do c s : 
new_review  =  [] 

for  token  in  review: 

new_ token  =  regex . sub (u  '  '  ,  token) 
if  not  new_token  =  =  u'  '  : 

new_r evi ew . append ( new_ token ) 
t oken i z ed_do c s_no_punc t ua t i on . append (new_review 
) 

print  tokeniz  ed_do  c  s_no_punc  tuation 

V _ / 
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Out [ 4 ]  : 


In  [  5  ]  : 


Out [ 5 ]  : 


[['Here',  'are',  'some',  'very',  'simple',  'basic', 

' sentences ' ] , 

['They',  'wo',  u'nt',  'be',  'very',  'interesting',  'I',  u'm' , 

' afraid ' ] , 

['The',  'point',  'of',  'these',  'examples',  'is',  'to', 

u' learn',  'how',  'basic',  'text',  'cleaning',  u' works',  'on', 
u'very' ,  u' simple',  'data']] 

One  can  see  that  punctuation  symbols  are  removed,  and  those  words  containing 
a  punctuation  symbol  are  kept  and  marked  with  an  initial  u.  If  the  reader  wants 
more  details,  we  recommend  to  read  information  about  the  RE  package  for  treating 
expressions. 

Another  important  step  in  many  data  mining  systems  for  text  analysis  consists  of 
stemming  and  lemmatizing.  Morphology  is  the  notion  that  words  have  a  root  form. 
If  you  want  to  get  to  the  basic  term  meaning  of  the  word,  you  can  try  applying 
a  stemmer  or  lemmatizer.  This  step  is  useful  to  reduce  the  dictionary  size  and  the 
posterior  high-dimensional  and  sparse  feature  spaces.  NLTK  provides  different  ways 
of  performing  this  procedure.  In  the  case  of  running  the  porter  .  stem  (word) 
approach,  the  output  is  shown  next. 

/  \ 

from  nl tk . stem . porter  import  Port erS t emmer 
from  nl tk . s t em . snowbal 1  import  Snowbal 1 S t emmer 
from  nl tk . s tern . wordnet  import  WordNe tLemma t i z er 
porter  =  Por terS t emmer ( ) 

#snowball  =  Snowba 1 1 S t emmer  (  '  english  '  ) 

#wordnet  =  WordNe tLemma t i z er ( ) 

#each  of  the  following  commands  perform  stemming  on 
word 

porter  .  stem (word) 

#  snowball  .  stem (word) 

#wordnet . lemmatize (word) 

V _ / 


[['Here',  'are',  'some',  'very',  'simple',  'basic', 
'sentences'],  ['They',  'wo',  u'nt',  'be',  'very', 

'interesting',  'I',  u'm',  'afraid'],  ['The',  'point',  'of', 
'these',  'examples',  'is',  'to',  u' learn',  'how',  'basic', 

'text',  'cleaning',  u'works',  'on',  u'very',  u'simple', 

' data ' ] ] 

[['Here',  'are',  'some',  'veri',  'simpl',  'basic',  'sentenc'], 
['They',  'wo',  u'nt',  'be',  'veri',  'interest',  'I',  u'm', 
'afraid'],  ['The',  ' point ',' of ' ,  'these',  ' exampl ' ,  'is', 

'to',  u'learn',  'how',  'basic',  'text',  'clean',  u'work' ,  'on', 

u ' veri ', u ' simpl ' ,  'data']] 


1  https://docs.python.Org/2/library/re.html. 
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In  [  6  ]  : 


Out [ 6 ]  : 


This  kind  of  approaches  are  very  useful  in  order  to  reduce  the  exponential  number 
of  combinations  of  words  with  the  same  meaning  and  match  similar  texts.  Words 
such  as  “interest”  and  “interesting”  will  be  converted  into  the  same  word  “interest” 
making  the  comparison  of  texts  easier,  as  we  will  see  later. 

Another  very  useful  data  cleaning  procedure  consists  of  removing  HTML  entities 
and  tags.  Those  may  contain  words  and  other  symbols  that  were  not  removed  by 
applying  the  previous  procedures,  but  that  do  not  provide  useful  meaning  for  text 
analysis  and  will  introduce  noise  in  our  posterior  text  representation  procedure.  There 
are  many  possibilities  for  removing  these  tags.  Here  we  show  another  example  using 
the  same  NLTK  package. 


C  \ 

import  nltk 

test_string  ="<p>While  many  of  the  stories  tugged 

at  the  heartstrings  ,  I  never  felt  manipulated  by 
the  authors.  (Note:  Part  of  the  reason  why  I 
don't  like  the  'Chicken  Soup  for  the  Soul' 
series  is  that  I  feel  that  the  authors  are  just 
dying  to  make  the  reader  clutch  for  the  box  of 
tissues . ) </a>" 
print  ' Original  text : ' 
print  test_string 
print  'Cleaned  text:  ' 

nltk .  cl ean_h  tml  ( test_string . decode  ( )  ) 


Original  text : 

<p>While  many  of  the  stories  tugged  at  the  heartstrings,  I 
never  felt  manipulated  by  the  authors.  (Note:  Part  of  the 
reason  why  I  don't  like  the  "Chicken  Soup  for  the  Soul"  series 
is  that  I  feel  that  the  authors  are  just  dying  to  make  the 
reader  clutch  for  the  box  of  tissues. ) </a> 

Cleaned  text: 

u"While  many  of  the  stories  tugged  at  the  heartstrings,  I  never 
felt  manipulated  by  the  authors.  (Note:  Part  of  the  reason  why 
I  don't  like  the  "Chicken  Soup  for  the  Soul"  series  is  that  I 
feel  that  the  authors  are  just  dying  to  make  the  reader  clutch 
for  the  box  of  tissues.)" 

You  can  see  that  tags  such  as  “<p>”  and  “</a>”  have  been  removed.  The  reader 
is  referred  to  the  RE  package  documentation  to  learn  more  about  how  to  use  it  for 
data  cleaning  and  HTLM  parsing  to  remove  tags. 


1 0.3  Text  Representation 

In  the  previous  section  we  have  analyzed  different  techniques  for  data  cleaning,  stem¬ 
ming,  and  lemmatizing,  and  filtering  the  text  to  remove  other  unnecessary  tags  for 
posterior  text  analysis.  In  order  to  analyze  sentiment  from  text,  the  next  step  consists 
of  having  a  representation  of  the  text  that  has  been  cleaned.  Although  different  rep- 
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Text  1:  "I  like  tomatoes  more  than  apples" 

Text  2:  "I  like  reading" 

Text  1:  I  1  I  1  I  1  1  1  I  1  I  1  I 0 _ 

I  like  tomatoes  more  than  apples  reading 

Text  2:  I  1  I  1  I  0 _ 0  0 _ 0  I  1 

I  like  tomatoes  more  than  apples  reading 

Fig.  1 0.1  Example  of  BoW  representation  for  two  texts 

resentations  of  text  exist,  the  most  common  ones  are  variants  of  Bag  of  Words  (BoW) 
models  [1].  The  basic  idea  is  to  think  about  word  frequencies.  If  we  can  define  a 
dictionary  of  possible  different  words,  the  number  of  different  existing  words  will 
define  the  length  of  a  feature  space  to  represent  each  text.  See  the  toy  example  in 
Fig.  10.1.  Two  different  texts  represent  all  the  available  texts  we  have  in  this  case. 
The  total  number  of  different  words  in  this  dictionary  is  seven,  which  will  represent 
the  length  of  the  feature  vector.  Then  we  can  represent  each  of  the  two  available  texts 
in  the  form  of  this  feature  vector  by  indicating  the  number  of  word  frequencies,  as 
shown  in  the  bottom  of  the  figure.  The  last  two  rows  will  represent  the  feature  vector 
codifying  each  text  in  our  dictionary. 

Next,  we  will  see  a  particular  case  of  bag  of  words,  the  Vector  Space  Model  of 
text:  TF-IDF  (term  frequency-inverse  distance  frequency).  First,  we  need  to  count 
the  terms  per  document,  which  is  the  term  frequency  vector.  See  a  code  example 
below. 


Out[7]:  [('me',  2),  ('Mireia',  1),  ('loves',  2),  ('Hector',  1), 

('than',  1),  ('more',  1)]  [('me',  2),  ('Mireia',  1),  ('likes', 

1),  ('loves',  1),  ('Sergio',  1),  ('than',  1),  ('more',  1)] 
[('basketball',  1),  ('football',  1),  ('likes',  1),  ('He',  1), 

('than',  1),  ('more',  1)] 

Here,  we  have  introduced  the  Python  object  called  a  Counter.  Counters  are  only 
in  Python  2.7  and  higher.  They  are  useful  because  they  allow  you  to  perform  this 
exact  kind  of  function:  counting  in  a  loop.  A  Counter  is  a  dictionary  subclass  for 
counting  hashable  objects.  It  is  an  unordered  collection  where  elements  are  stored  as 
dictionary  keys  and  their  counts  are  stored  as  dictionary  values.  Counts  are  allowed 
to  be  any  integer  value  including  zero  or  negative  counts. 
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In  [8] : 


In  [  9  ]  : 


Out [ 9 ]  : 


In  [10] : 


Elements  are  counted  from  an  iterable  or  initialized  from  another  mapping  (or 
Counter). 


z 

\ 

c  =  Counter  (  ) 

# 

a  new,  empty  counter 

c  =  Counter ( ' gallahad ' ) 

# 

a  new  counter  from  an 

i t  er abl e 

v 

_ y 

Counter  objects  have  a  dictionary  interface  except  that  they  return  a  zero  count 
for  missing  items  instead  of  raising  a  KeyError. 


z 

\ 

c  =  Counter (['eggs',  ' ham ' ] ) 

c [ ' bacon ' ] 

v 

_ y 

o 


Let  us  call  this  a  first  stab  at  representing  documents  quantitatively,  just  by  their 
word  counts  (also  thinking  that  we  may  have  previously  filtered  and  cleaned  the  text 
using  previous  approaches).  Here  we  show  an  example  for  computing  the  feature 
vector  based  on  word  frequencies. 


def  bui ld_l exi con ( corpus ) : 

#  define  a  set  with  all  possible  words  included  in 
all  the  sentences  or  "corpus" 
lexicon  =  set  (  ) 
for  doc  in  corpus  : 

lexicon  .  update  ([  word  for  word  in  doc.  split 

(  )  ]  ) 

return  lexicon 
def  tf (term,  document) : 

return  freq ( term  ,  document) 
def  freq ( term ,  document) : 

return  do cument . spl i t ( ) .count (term) 
vocabulary  =  bui ld_l exi c on ( mydo c 1 i s t ) 
doc_term_matrix  =  [] 


pr 
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n  t 
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vo  c  abu 1 a 

ry  ve  c 

tor 
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o 

c 

i 

n 

mydo  c 1 i s 

t  : 

P 

r 

i 

n  t 

' The  doc 

is  "  ' 

+  do 

c 

+ 

/ 

ii 

/ 

t 

f 

ve 

c 

tor  =  [ t  f 

( word , 

doc 

) 

f 

or 

wo 

r 

d 

i 

n 

VO  c 

:abulary ] 

t 

f 

ve 

c 

t  o r_s  t  r i n 

g  =  '  / 

'  •  j 

o  i 

n 

(  f 

o 

rm 

a 

t 

( 

f 

r 

eq  , 

'  d  '  ) 

f 

o 

r 

f 

r 

e 

q 

i 

n 

t 

f 

V 

e 

c 

tor 

) 

P 

r 

i 

n  t 

' The  t  f  v 

e  c  t  o  r 

for 

Do 

c 

um 

e 

n  t 

% 

d 

i 

s  [ 

%  s  ]  ' 

%  (  ( mydo  c 

list  .  i 

ndex 

(d 

o 

c  ) 

+ 

1) 

/ 

t  f _ve  c  tor_strinc 

r) 

d 

o 

c 

_  t 

e 

rm_ma  t r i x 

. appen 

d  (  t  f 

_v 

e 

c  t 

o 

r  ) 

pr 

i 

n  t 

/ 

Al 

1 

combined 

,  here 

i  s 

ou 

r 

m 

a 

s  t 

e 

r 

d 

o 

cum 

en  t 

term  matrix : 


print  do c_ t e rm_ma t r i x 
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Outfit)]:  Our  vocabulary  vector  is  [me,  basketball,  Julie,  baseball, 
likes,  loves,  Jane,  Linda,  He,  than,  more] 

The  doc  is  "Julie  loves  me  more  than  Linda  loves  me" 

The  tf  vector  for  Document  lis  [2,  0,  1,  0,  0,  2,  0,  1,  0,  1, 

1] 

The  doc  is  "Jane  likes  me  more  than  Julie  loves  me" 

The  tf  vector  for  Document  2  is  [2,  0,  1,  0,  1,  1,  1,  0,  0,  1, 

1] 

The  doc  is  "He  likes  basketball  more  than  baseball" 

The  tf  vector  for  Document  3  is  [0,  1,  0,  1,  1,  0,  0,  0,  1,  1, 

1] 

All  combined,  here  is  our  master  document  term  matrix: 

[[2,  0,  1,  0,  0,  2,  0,  1,  0,  1,  1],  [2,  0,  1,  0,  1,  1,  1,  0,  0, 
1,  1],  [0,  1,  0,  1,  1,  0,  0,  0,  1,  1,  1]] 


In  [11] 


Now,  every  document  is  in  the  same  feature  space,  meaning  that  we  can  represent 
the  entire  corpus  in  the  same  dimensional  space.  Once  we  have  the  data  in  the 
same  feature  space,  we  can  start  applying  some  machine  learning  methods:  learning, 
classifying,  clustering,  and  so  on.  But  actually,  we  have  a  few  problems.  Words  are 
not  all  equally  informative.  If  words  appear  too  frequently  in  a  single  document, 
they  are  going  to  muck  up  our  analysis.  We  want  to  perform  some  weighting  of  these 
term  frequency  vectors  into  something  a  bit  more  representative.  That  is,  we  need  to 
do  some  vector  normalizing.  One  possibility  is  to  ensure  that  the  L2  norm  of  each 
vector  is  equal  to  1 . 

/  \ 
import  math 

def  1 2_normal i z er ( vec ) : 

denom  =  np  .  sum  (  [  el  *  *  2  for  el  in  vec]) 
return  [(el  /  math  .  sqrt  (  denom)  )  for  el  in  vec] 
doc_term_matrix_12  =  [] 

for  vec  in  do c_ t e rm_ma t r ix : 

do  c_ t  erm_ma  t  rix_12  .  append ( 12  _no  rma lizer  (vec)  ) 
print  'A  regular  old  document  term  matrix:  ' 
print  np . matrix ( doc_term_matr ix ) 

print  ' \nA  document  term  matrix  with  row-wise  L2 
norm :  ' 

print  np . matrix ( doc_t erm_matr ix_l 2 ) 

V _ / 


Out [11] :  A  regular  old  document  term  matrix: 

[[20100201011] 

[2  010111001  1] 

[01011000111]] 

A  document  term  matrix  with  row-wise  L2  norm: 

[[  0.57735027  0.  0.28867513  0.  0.  0.57735027 
0.  0.28867513  0.  0.28867513  0.28867513] 

[  0.63245553  0.  0.31622777  0.  0.31622777  0.31622777 
0.31622777  0.  0.  0.31622777  0.31622777] 

[  0.  0.40824829  0.  0.40824829  0.40824829  0.  0. 

0.  0.40824829  0.40824829  0.40824829]] 
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In  [12] 


You  can  see  that  we  have  scaled  down  the  vectors  so  that  each  element  is  between 
[0,1].  This  will  avoid  getting  a  diminishing  return  on  the  informative  value  of  a  word 
massively  used  in  a  particular  document.  For  that,  we  need  to  scale  down  words  that 
appear  too  frequently  in  a  document. 

Finally,  we  have  a  final  task  to  perform.  Just  as  not  all  words  are  equally  valuable 
within  a  document,  not  all  words  are  valuable  across  all  documents.  We  can  try 

reweighting  every  word  by  its  inverse  document  frequency. 

/ - \ 

def  numDocsContaining (word ,  doclist) : 
doccount  =  0 

for  doc  in  doclist: 

if  freq  (word  ,  doc)  >  0: 

doc  count  +=  1 

return  doccount 
def  idf (word,  doclist) : 

n_samples  =  len ( doclist ) 

df  =  numDocsContaining ( word ,  doclist) 
return  np  .  log  (  n_samples  /  (  float  (  df  )  )  ) 

my_i d f _ve c t o r  =  [idf (word,  mydoclist)  for  word  in 
vocabulary ] 

print  'Our  vocabulary  vector  is  ['  +  ',  '.join  (list 

(vocabulary) )  +  ' ] ' 

print  'The  inverse  document  frequency  vector  is 
['  +  ',  '.join (format (freq,  ' f ' )  for  freq  in 

my_i d f _ve c t o r )  +  ']  ' 

v _ y 


Out[12]:  Our  vocabulary  vector  is  [me,  basketball,  Mireia,  football, 
likes,  loves,  Sergio,  Hector,  He,  than,  more] 

The  inverse  document  frequency  vector  is  [0.405465,  1.098612, 
0.405465,  1.098612,  0.405465,  0.405465,  1.098612,  1.098612, 
1.098612,  0.000000,  0.000000] 

Now  we  have  a  general  sense  of  information  values  per  term  in  our  vocabulary, 
accounting  for  their  relative  frequency  across  the  entire  corpus.  Note  that  this  is 
an  inverse.  To  get  TF-IDF  weighted  word-vectors,  we  have  to  perform  the  simple 
calculation  of  the  term  frequencies  multiplied  by  the  inverse  frequency  values. 

In  the  next  example  we  convert  our  IDF  vector  into  a  matrix  where  the  diagonal 
is  the  IDF  vector. 


def  bu i 1 d_i d f _ma t r i x ( i d f _ve c t o r ) : 

idf_mat  =  np  .  zeros  (  (  len  (  idf  _vec  tor  )  ,  len  ( 
i d f _ve c  tor)  )  ) 

np . fill_diagonal ( idf _mat ,  idf _vec tor ) 
return  idf_mat 

my_idf _mat r ix  =  bu i 1 d_i d f _ma t r i x ( my_i d f _ve c t o r ) 
print  my_idf _ma tr ix 


In  [13] : 
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Out [13]  :  [  [ 
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In  [14] 


That  means  we  can  now  multiply  every  term  frequency  vector  by  the  inverse 
document  frequency  matrix.  Then,  to  make  sure  we  are  also  accounting  for  words 
that  appear  too  frequently  within  documents,  we  will  normalize  each  document  using 
the  L2  norm. 

/  \ 

do c_t e rm_ma t r i x_t f i d f  =  [] 

#performing  tf-idf  matrix  multiplication 
for  tf_vector  in  do c_ t e rm_ma t r ix : 

doc_t erm_matr ix_t  fidf  . append ( np . dot  ( t  f _vec tor  , 
my_i d f _ma t r ix ) ) 

#  normal i z i ng 

do c_t erm_ma t r ix_t f i d f _1 2  =  [] 

for  tf_vector  in  doc_t erm_mat r ix_t f idf : 
do c_t erm_ma t r ix_t f i d f _1 2 . 
append ( 12_normalizer  ( t  f _vec  tor )  ) 
print  vocabulary 

#  np  .  matrix  (  )  just  to  make  it  easier  to  look  at 
print  np . matrix  ( do  c_ t  erm_ma  trix_tfidf_12  ) 

V _ / 


Out [14]:  set ( [ 'me' ,  'basketball',  'Mireia',  'football',  'likes', 

'loves',  'Sergio',  'Linda',  'He',  'than',  'more']) 

[[  0.49474872  0.  0.24737436  0.  0.  0.49474872  0.  0.67026363  0. 

0.  0.  ] 

[  0.52812101  0.  0.2640605  0.  0.2640605  0.2640605  0.71547492  0. 

0.  0.  0.  ] 

[  0.  0.56467328  0.  0.56467328  0.20840411  0.  0.  0.  0.56467328  0. 
0.  ]] 


1 0.3.1  Bi-Grams  and  n-Grams 

It  is  sometimes  useful  to  take  significant  bi-grams  into  the  model  based  on  the  BoW. 
Note  that  this  example  can  be  extended  to  n -grams.  In  the  fields  of  computational 
linguistics  and  probability,  an  ft -gram  is  a  contiguous  sequence  of  n  items  from 
a  given  sequence  of  text  or  speech.  The  items  can  be  phonemes,  syllables,  letters, 
words,  or  base  pairs  according  to  the  application.  The  ft -grams  are  typically  collected 
from  a  text  or  speech  corpus. 
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A  ft-gram  of  size  1  is  referred  to  as  a  “uni-gram”;  size  2  is  a  “bi-gram”  (or,  less 
commonly,  a  “digram”);  size  3  is  a  “tri-gram”.  Larger  sizes  are  sometimes  referred 
to  by  the  value  of  n ,  e.g.,  “four-gram”,  “five-gram”,  and  so  on.  These  ft-grams  can 
be  introduced  within  the  BoW  model  just  by  considering  each  different  n-gmm  as  a 
new  position  within  the  feature  vector  representation. 


1 0.4  Practical  Cases 

Python  packages  provide  useful  tools  for  analyzing  text.  The  reader  is  referred  to 
the  NLTK  and  Textblob  package  documentation  for  further  details.  Here,  we  will 
perform  all  the  previously  presented  procedures  for  data  cleaning,  stemming,  and 
representation  and  introduce  some  binary  learning  schemes  to  learn  the  text  repre¬ 
sentations  in  the  feature  space.  The  binary  learning  schemes  will  receive  examples 
for  training  positive  and  negative  sentiment  texts  and  we  will  apply  them  later  to 
unseen  examples  from  a  test  set. 

We  will  apply  the  whole  sentiment  analysis  process  in  two  examples.  The  first 
corresponds  to  the  Large  Movie  reviews  dataset  [2].  This  is  one  of  the  largest  public 
available  data  sets  for  sentiment  analysis,  which  includes  more  than  50,000  texts 
from  movie  reviews  including  the  groundtruth  annotation  related  to  positive  and 
negative  movie  reviews.  As  a  proof  on  concept,  for  this  example  we  use  a  subset  of 
the  dataset  consisting  of  about  30%  of  the  data. 

The  code  reuses  part  of  the  previous  examples  for  data  cleaning,  reads  training 
and  testing  data  from  the  folders  as  provided  by  the  authors  of  the  dataset.  Then, 
TF-IDF  is  computed,  which  performs  all  steps  mentioned  previously  for  computing 
feature  space,  normalization,  and  feature  weights.  Note  that  at  the  end  of  the  script  we 
perform  training  and  testing  based  on  two  different  state-of-the-art  machine  learning 
approaches:  Naive  Bayes  and  Support  Vector  Machines.  It  is  beyond  the  scope  of 
this  chapter  to  give  details  of  the  methods  and  parameters.  The  important  point  here 
is  that  the  documents  are  represented  in  feature  spaces  that  can  be  used  by  different 
data  mining  tools. 


2https://textblob.readthedocs.io/en/dev/. 
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In  [ 15 ] : 


f  \ 

from  nl tk . t okeni z e  import  wo rd_t okeni z e 
from  nl tk . stem . porter  import  Port erS t emmer 
from  sklearn . f ea ture_extrac t ion . text  import 
TfidfVectorizer 

from  nl tk .  c 1  as s i f y  import  NaiveBayesClassif ier 
from  sklearn . naive_bayes  import  GaussianNB 
from  sklearn  import  svm 
from  unidecode  import  unidecode 

de f  BoW ( text )  : 

#  Tokenizing  text 

text_t okeni z ed  =  [ word_t okeni z e ( doc )  for  doc  in 

text  ] 

#  Removing  punctuation 

regex  =  re.  compile  (  '  [  %  s  ]  '  %  re . escape ( string . 

punc  tua t i on )  ) 

t oken i z ed_do c s_no_punc t ua t i on  =  [] 

for  review  in  text_t okeni z ed : 
new_review  =  [] 

for  token  in  review: 

new_ token  =  regex . sub  (u'  '  ,  token) 
if  not  new_token  =  =  u'  '  : 

new_r evi ew . append ( new_ token ) 
t oken i z ed_do c s_no_punc t ua t i on . append ( 
new_r evi ew ) 

#  Stemming  and  Lemmatizing 

porter  =  Por terS t emmer ( ) 
pr epr o c e s s ed_do c s  =  [] 

for  doc  in  t o ken i z ed_do c s_no_punc t ua t i on : 
f inal_doc  =  ' ' 

for  word  in  doc  : 

final_doc  =  final_doc  +  '  '  +  porter, 

s  tern ( word ) 

preprocess  ed_do  c  s  . append ( final_doc ) 
return  preprocessed_docs 

#read  your  train  text  data  here 
textTrain= ReadTrainDat aText () 

pr epr o c e s s ed_do c s = BoW ( t ex t Tr a i n )  #  for  train  data 

#  Computing  TIDF  word  space 

t f idf _vec tor i z er  =  TfidfVectorizer (min_df  =  1) 

trainData  =  t f i d f _ve c t o r i z e r . f i t_t r ans f o rm ( 
preprocess  ed_do  c  s ) 

t extTes t = ReadTes tDa taText ( )  #read  your  test  text 
data  here 

pr  epr  o_do  c  s_  t  e  s  t  =  BoW  (  t  ex  t  Te  s  t  )  #  for  test  data 

testData  =  tf idf_vectorizer . transform ( 
prepro_docs_test ) 

v _ / 


10.4  Practical  Cases 


193 


In  [16] : 


( - \ 

print  (  '  Training  and  testing  on  training  Naive  Bayes 

'  ) 

gnb  =  GaussianNB  (  ) 
testData . todense ( ) 

y_pred  =  gnb  .  fit  (  trainData  .  todense  (  )  ,  targetTrain) 

. predict ( trainData . todense ( ) ) 
print ( "Number  of  mislabeled  training  points  out  of 
a  total  %d  points  :  %d" 

%  ( trainData .  shape  [  0 ]  ,  (targetTrain  !=  y_pred) 

.  sum  (  )  )  ) 

y_pred  =  gnb . fit ( trainData . todense ( ) ,  targetTrain) 

. predict ( testData . todense ( ) ) 
print ( "Number  of  mislabeled  test  points  out  of  a 
total  %d  points  :  %d"  % 

( testData . shape [0] , ( targetTest  !=  y_pred ) . sum 

(  )  )  ) 

print  (  ' Training  and  testing  on  train  with  SVM  '  ) 
c 1 f  =  s vm .  SVC  (  ) 

elf  .  f it  ( trainData  .  todense  (  )  ,  targetTrain) 
y_pr ed  =  elf . predict  ( trainData  .  todense  (  )  ) 
print ( "Number  of  mislabeled  test  points  out  of  a 
total  %d  points  :  %d"  % 

( trainData .  shape  [  0 ]  ,  (targetTrain  !=  y_pred)  . 
sum  (  )  )  ) 

print ( ' Testing  on  test  with  already  trained  SVM') 
y_pred  =  elf . predict ( testData . todense ( ) ) 
print ( "Number  of  mislabeled  test  points  out  of  a 
total  %d  points  :  %d"  % 

( testData .  shape  [0]  ,  ( targetTest  !=  y_pred )  .  sum 

(  )  )  ) 

V _ / 


In  addition  to  the  machine  learning  implementations  provided  by  the  Scikit- 
learn  module  used  in  this  example,  NLTK  also  provides  useful  learning  tools  for 
text  learning,  which  also  includes  Naive  Bayes  classifiers.  Another  related  pack¬ 
age  with  similar  functionalities  is  Textblob.  The  results  of  running  the  script  are 
shown  next. 
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Out [16] :  Training  and  testing  on  training  Naive  Bayes 

Number  of  mislabeled  training  points  out  of  a  total  4313  points 
:  129 

Number  of  mislabeled  test  points  out  of  a  total  6292  points  : 

2087 

Training  and  testing  on  train  with  SVM 

Number  of  mislabeled  test  points  out  of  a  total  4313  points  : 

1288 

Testing  on  test  with  already  trained  SVM 

Number  of  mislabeled  test  points  out  of  a  total  6292  points  : 

1680 

We  can  see  that  the  training  error  of  Naive  Bayes  on  the  selected  data  is  129/4313 
while  in  testing  it  is  2087/6292.  Interestingly,  the  training  error  using  SVM  is  higher 
(1288/4313),  but  it  provides  a  better  generalization  of  the  test  set  than  Naive  Bayes 
(1680/6292).  Thus  it  seems  that  Naive  Bayes  produces  more  overfitting  of  the  data 
(selecting  particular  features  for  better  learning  the  training  data  but  producing  such 
high  modifications  of  the  feature  space  for  testing  that  cannot  be  recovered,  just 
reducing  the  generalization  capability  of  the  technique).  However,  note  that  this  is  a 
simple  execution  with  standard  methods  on  a  subset  of  the  dataset  provided.  More 
data,  as  well  as  many  other  aspects,  will  influence  the  performance.  For  instance, 
we  could  enrich  our  dictionary  by  introducing  a  list  of  already  studied  positive  and 
negative  words.  For  further  details  of  the  analysis  of  this  dataset,  the  reader  is 
referred  to  [2]. 

Finally,  let  us  see  another  example  of  sentiment  analysis  based  on  tweets.  Although 
there  is  some  work  using  more  tweet  data  here  we  present  a  reduced  set  of  tweets 
which  are  analyzed  as  in  the  previous  example  of  movie  reviews.  The  main  code 
remains  the  same  except  for  the  definition  of  the  initial  data. 


3 Such  as  those  provided  in  http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html. 
4http://www.sananalytics.com/lab/twitter- sentiment/. 
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In  [18] : 


( - \ 

textTest  =  [  'The  beer  was  good  .  '  ,  'I  do  not  enjoy 

my  job' ,  'I  aint  feeling  dandy  today' ,  'I  feel 
amazing!',  'Gary  is  a  friend  of  mine.',  'I  can 
not  believe  I  am  doing  this  .  '  ] 
targetTest  =  [0,  1,  1,  0,  0,  1] 

preprocess  ed_do  c  s  =  BoW ( textTest ) 
testData  =  tf idf_vectorizer . transform ( 
preprocess  ed_do  c  s ) 

print (' Training  and  testing  on  test  Naive  Bayes') 
gnb  =  GaussianNB  (  ) 
testData .  todense  (  ) 

y_pr ed  =  gnb  .  fit  (  trainData  .  todense  (  )  ,  targetTrain) 

. predict ( trainData . todense ( ) ) 
print ( "Number  of  mislabeled  training  points  out  of 
a  total  %d  points  :  %d"  %  ( trainData  .  shape  [  0 ]  ,  ( 

targetTrain  !=  y_pred )  .  sum  (  )  )  ) 

y_pred  =  gnb . fit ( trainData . todense ( ) ,  targetTrain) 

. predict ( testData . todense ( ) ) 
print ( "Number  of  mislabeled  test  points  out  of  a 
total  %d  points  :  %d"  %  ( testData . shape [0] , ( 

targetTest  !=  y_pred )  .  sum  (  )  )  ) 

print  (' Training  and  testing  on  train  with  SVM  '  ) 
c 1 f  =  s vm .  SVC  (  ) 

elf  .  f it  ( trainData  .  todense  (  )  ,  targetTrain) 
y_pr ed  =  elf . predict  ( trainData  .  todense  (  )  ) 
print ( "Number  of  mislabeled  test  points  out  of  a 
total 

%d  points  :  %d" 

%  (trainData. shape [0], (targetTrain  !=  y_pred 
)  .  sum  (  )  )  ) 

print (' Testing  on  test  with  already  trained  SVM') 
y_pr ed  =  elf . predict ( testData . todense ( ) ) 
print ( "Number  of  mislabeled  test  points  out  of  a 
total 

%d  points  :  %d" 

%  (testData. shape [0], (targetTest  !=  y_pr ed ) . 
sum  (  )  )  ) 

V _ / 


Out [17] :  Training  and  testing  on  test  Naive  Bayes 

Number  of  mislabeled  training  points  out  of  a  total  10  points  :  0 
Number  of  mislabeled  test  points  out  of  a  total  6  points  :  2 
Training  and  testing  on  train  with  SVM 

Number  of  mislabeled  test  points  out  of  a  total  10  points  :  0 
Testing  on  test  with  already  trained  SVM 

Number  of  mislabeled  test  points  out  of  a  total  6  points  :  2 

In  this  scenario  both  learning  strategies  achieve  the  same  recognition  rates  in  both 
training  and  test  sets.  Note  that  similar  words  are  shared  between  tweets.  In  practice, 
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with  real  examples,  tweets  will  include  unstructured  sentences  and  abbreviations, 
making  recognition  harder. 


10.5  Conclusions 

In  this  chapter,  we  have  analyzed  the  problem  of  binary  sentiment  analysis  of  text 
data:  data  cleaning  to  remove  irrelevant  symbols,  punctuation  and  tags;  stemming  in 
order  to  define  the  same  root  for  different  works  with  the  same  meaning  in  terms  of 
sentiment;  defining  a  dictionary  of  words  (including  n -grams);  and  representing  text 
in  terms  of  a  feature  space  with  the  length  of  the  dictionary.  We  have  also  seen  cod¬ 
ification  in  this  feature  space,  based  on  normalized  and  weighted  term  frequencies. 
We  have  defined  feature  vectors  that  can  be  used  by  any  machine  learning  tech¬ 
nique  in  order  to  perform  sentiment  analysis  (binary  classification  in  the  examples 
shown),  and  reviewed  some  useful  Python  packages,  such  as  NLTK  and  Textblob, 
for  sentiment  analysis. 

As  discussed  in  the  introduction  of  this  chapter,  we  have  only  reviewed  the  senti¬ 
ment  analysis  problem  and  described  common  procedures  for  performing  the  analysis 
resulting  from  a  binary  classification  problem.  Several  open  issues  can  be  addressed 
in  further  research,  such  as  the  identification  of  sarcasm,  a  lack  of  text  structure  (as 
in  tweets),  many  possible  sentiment  categories  and  degrees  (not  only  binary  but  also 
multiclass,  regression,  and  multilabel  problems,  among  others),  identification  of  the 
object  of  analysis,  or  subjective  text,  to  name  a  few. 

The  tools  described  in  this  chapter  can  define  a  basis  for  dealing  with  those  more 
challenging  problems.  One  recent  example  of  current  state-of-the-art  research  is  the 
work  of  [3],  where  deep  learning  architectures  are  used  for  sentiment  analysis.  Deep 
learning  strategies  are  currently  a  powerful  tool  in  the  fields  of  pattern  recognition, 
machine  learning,  and  computer  vision,  among  others;  the  main  deep  learning  strate¬ 
gies  are  based  on  neural  network  architectures.  In  the  work  of  [3],  a  deep  learning 
model  builds  up  a  representation  of  whole  sentences  based  on  the  sentence  struc¬ 
ture,  and  it  computes  the  sentiment  based  on  how  words  form  the  meaning  of  longer 
phrases.  In  the  methods  explained  in  this  chapter,  n -grams  are  the  only  features  that 
capture  those  semantics.  For  further  discussion  in  this  field,  the  reader  is  referred 
to  [4,5]. 
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Parallel  Computing 


11.1  Introduction 

The  computer  industry  underwent  a  vigorous  shake-up  several  years  ago.  Major  chip 
manufacturers  gave  up  trying  to  increase  processor  frequency.  Each  year,  more  and 
more  transistors  fit  into  the  same  space,  but  their  clock  speed  cannot  be  increased 
without  overheating.  Thus,  rather  than  trying  to  increase  the  clock  speed,  manufac¬ 
turers  turned  to  multicore  architectures.  A  multicore  processor  is  a  single  computing 
component  with  two  or  more  processing  units  (called  “cores”)  which  read  and  exe¬ 
cute  program  instructions.  Multiple  cores  can  run  different  instructions  at  the  same 
time,  thereby  increasing  the  overall  speed  of  programs  susceptible  to  parallel  com¬ 
puting.  Within  multicore  systems,  the  cores  communicate  through  hardware  (the  bus) 
in  order  to  synchronize  access  to  common  resources  such  as  RAM. 

The  operating  system  is  the  application  that  manages  these  multiple  cores.  If 
two  computation-intensive  processes  (i.e.,  applications)  are  run  on  the  computer,  the 
operating  system  manages  things  so  that  each  task  is  run  on  a  different  core.  If  we 
have  a  single  computation-intensive  task,  it  will  only  run  on  one  core,  even  if  our 
computer  has  multiple  cores.  If  nothing  is  done  explicitly,  we  will  waste  a  lot  of 
computation  power! 

Currently,  in  most  parallel  programming  frameworks,  the  programmer  has  to 
manually  split  the  computation  work  into  multiple  tasks  so  that  each  one  is  executed 
in  different  cores.  The  programmer  has  to  perform  the  split  and  the  operating  system 
will  then  automatically  execute  each  task  on  a  different  core.  So,  each  task  has 
to  be  run  in  different  processes  or  threads.  This  is  the  principle  behind  parallel 
programming;  harnessing  multiple  processors  to  work  on  a  single  task  by  dividing 
it  into  multiple  (smaller)  tasks. 

In  order  to  make  the  most  of  multicore  capabilities,  the  number  of  processes 
should  be  equal  to  the  number  of  processors.  Within  a  parallel  computing  context, 
it  does  not  make  much  sense  to  define  more  tasks  than  cores  we  have,  e.g.,  defining 
eight  computation-intensive  tasks  if  our  computer  only  has  four  cores.  In  this  latter 
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case,  the  operating  system  will  try  to  run  eight  tasks  using  four  cores.  This  is  done  by 
switching  between  the  tasks  in  such  a  way  that  each  one  gets  approximately  the  same 
amount  of  computing  time.  Switching  between  tasks  has  a  computational  cost  and 
thus  overall  performance  may  suffer  if  the  number  of  simultaneous  tasks  is  higher 
than  the  number  of  available  cores. 

Assume  that  a  task  takes  T  seconds  to  run  on  a  single  core  (using  standard  seri¬ 
alized  programming).  Now  assume  that  we  have  a  computer  with  N  cores  and  that 
we  have  divided  our  serialized  application  into  N  subtasks.  By  using  the  parallel 
capabilities  of  our  computer  we  may  be  able  to  reduce  the  total  computation  time  to 
T /N.  This  is  the  ideal  case  and  usually  we  will  not  be  able  to  reduce  the  computation 
time  by  a  factor  of  N.  This  is  due  to  the  fact  that  cores,  on  the  one  hand,  need  to  syn¬ 
chronize  at  the  hardware  level  in  order  to  access  common  resources  such  as  RAM; 
and,  on  the  other  hand,  the  operating  system  needs  some  time  to  switch  between 
all  the  tasks  that  run  on  the  computer.  However,  using  the  multicore  capabilities  of 
the  computer  unit  will  result  in  a  reduction  of  the  computation  time  if  the  tasks  are 
properly  defined. 

Parallelization  can  also  be  performed  by  means  of  distributed  computing.  While 
in  multicore  systems  the  cores  communicate  with  each  other  through  the  bus  at 
the  hardware  level,  in  distributed  systems  software  communicates  and  coordinates 
the  actions  of  computational  entities  located  within  a  network.  The  computational 
entities  are  usually  computers.  In  distributed  computing,  a  large  number  of  discrete 
computers,  named  nodes ,  distributed  across  a  network  (e.g.,  the  Internet)  devote 
some  or  all  of  their  computation  time  to  solving  a  common  problem;  each  node 
receives  and  completes  many  small  tasks,  reporting  the  results  to  a  central  server 
which  integrates  the  results  into  the  overall  solution.  Each  of  the  nodes  has  its  own 
local  memory  and  thus  tasks  that  run  on  different  computers  do  not  need  to  coordinate 
access  to  it.  However,  since  information  is  exchanged  through  the  network,  care  must 
be  taken  in  order  to  select  the  amount  of  information  that  is  passed  so  as  to  optimize 
the  computational  performance. 

In  this  chapter  we  will  focus  on  IPython’s  capabilities  for  parallel  computing,  on 
both  multicore  and  distributed  systems.  IPython  does  indeed  offer  an  environment 
capable  of  dealing  with  both  architectures  in  a  transparent  manner  for  the  program¬ 
mer.  The  user  should  be  aware  of  the  underlying  architecture  in  which  the  application 
will  be  run  in  order  to  avoid  loss  of  performance.  We  would  like  to  point  out  that 
Python  currently  does  not  offer  support  for  the  parallel  capabilities  explained  below. 
IPython,  however,  supports  them. 


11.2  Architecture 

Figure  11.1  shows  a  simplified  version  of  the  IPython  architecture  for  parallel  com¬ 
puting  (multicore  and  distributed).  The  proposed  architecture  enables  IPython  to 


^or  a  more  detailed  description  please  see  http://ipyparallel.readthedocs.io/en/stable/intro.html. 
Last  seen  July  2016. 
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Fig.  11.1  IPython’s 
architecture  for  parallel 
computing  (multicore  and 
distributed) 


support  many  different  styles  of  parallelism  including  those  described  in  this  chapter. 

Each  of  the  blocks  is  explained  below: 

•  Each  engine  is  an  instance  of  IPython,  usually  an  IPython  interpreter,  that  receives 
commands  through  a  connection.  When  multiple  engines  are  started,  multicore 
and  distributed  computing  becomes  possible. 

•  The  scheduler  is  an  application  that  distributes  the  commands  to  the  engines.  We 
will  see  that  there  are  two  ways  of  distributing  this  work:  the  direct  view  and  the 
load-balanced  view,  described  in  later  sections. 

•  The  client  is  an  IPython  object  created  at  an  IPython  interpreter.  This  object  will 
allow  us  to  send  commands  to  the  IPython  engines. 

IPython  uses  the  term  cluster  to  refer  to  the  scheduler  and  the  set  of  engines  that 

make  parallelization  possible.  It  should  not  be  confused  with  the  term  cluster  used 

in  supercomputing.  In  addition,  the  reader  should  take  into  account  that: 

•  Each  engine  is  an  independent  instance  of  an  IPython  interpreter,  i.e.,  it  runs  an 
independent  process.  None  of  the  variables  declared  at,  e.g.,  engine  1  are  visible 
to  the  remaining  engines  or  to  the  client.  In  a  similar  way,  if  we  want  to  work  with 
numpy  functions,  we  should  import  this  toolbox  to  every  engine. 

•  We  may  be  able  to  control  at  which  engine  each  task  is  executed,  but  we  will  not 
be  able  to  control  on  which  core  each  engine  is  executed;  this  is  the  job  of  the 
operating  system. 


1 1 .2.1  Getting  Started 

To  use  IPython’s  parallel  capabilities,  the  first  thing  to  do  is  to  start  the  cluster.  There 
are  two  ways  of  doing  this: 

•  From  the  notebook  interface.  This  is  the  simplest  way  of  proceeding  and  is  the 
recommended  way  for  newbies  in  this  topic.  Within  the  IPython  notebook,  we 
can  use  the  Clusters  tab  of  the  dashboard,  and  press  Start  with  the  desired  number 
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In  [1]  : 


Out  [  1  ]  : 


of  cores,  under  the  desired  profile.  This  will  automatically  run  the  necessary 
commands  to  start  the  IPython  cluster.  In  this  case,  the  notebook  will  be  used  as 
the  interface  with  the  cluster;  i.e.,  we  will  be  able  to  send  different  tasks  to  the 
engines  using  the  web  interface. 

•  From  the  command  line  of  a  terminal.  We  can  run  the  following  command  to  start 
an  IPython  cluster: 

$  ipcluster  start 

This  command  will  create  a  cluster  with  N  engines,  where  N  equals  the  number 
of  cores.  If  we  want  to  create  a  cluster  with  a  different  number  of  engines,  we  just 
run: 

$  ipcluster  start  -n  4 

With  this  command  we  start  a  cluster  with  four  engines.  Once  the  engines  are 
started,  we  may  run  an  IPython  interpreter. 

$  ipython 


1 1 .2.2  Connecting  to  the  Cluster  (The  Engines) 

We  have  seen  how  to  initialize  the  cluster.  No  matter  which  way  we  initialize  the 
cluster,  the  following  commands  allow  us  to  connect  to  it.  These  commands  should 
either  be  introduced  through  the  notebook  or  be  typed  into  the  IPython  command 
line  interpreter  (the  client): 


\ 

from  IPython  import  parallel 

engines  =  par a 1 1 e 1  . C 1 i en t  (  ) 

eng i ne s . b 1 o c k  =  True 

print  engines . ids 

_ y 

[0,  1,  2,  3,  4,  5,  6,  7] 

These  commands  connect  to  the  cluster  and  output  the  number  of  engines  in  it. 
If  an  error  is  shown  when  running  the  commands,  the  cluster  has  not  been  correctly 
created.  We  will  explain  later  on  the  meaning  of  the  block  attribute. 

The  variable  engines  is  an  object  that  represents  the  available  engines  to  which 
commands  can  be  sent.  Let  us  now  see  two  different  ways  we  can  send  tasks  to  the 
engines:  the  first,  called  the  direct  view ,  is  simpler  and  allows  the  user  to  directly 
control  which  tasks  are  sent  to  which  engines;  the  second,  called  the  load-balanced 
view ,  delegates  to  the  IPython  scheduler  the  task  of  deciding  which  engines  each 
task  is  sent  to. 


More  information  on  ipcluster  profiles  can  be  found  at  http://ipython.readthedocs.io/en/stable/. 
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In  [2]  : 


In  [3]  : 


Out  [  3  ]  : 


In  [4]  : 


As  will  be  seen  next,  the  former  view  is  useful  if  a  task  can  be  evenly  distributed 
computationally  into  smaller  tasks;  whereas  the  second  is  more  useful  if  such  sub¬ 
division  cannot  be  easily  done.  For  instance,  if  we  have  to  analyze  multiple  data 
files,  the  direct  view  is  a  good  approach  if  all  the  files  have  approximately  the  same 
size.  But  if  the  files  differ  (quite  a  lot)  in  size,  the  load-balanced  view  is  the  better 
approach.  Let  us  now  see  both  approaches. 


1 1 .3  Multicore  Programming 
1 1 .3.1  Direct  View  of  Engines 

How  do  we  send  a  command  to  the  cluster?  Recall  that  the  engines  variable  just 
defined  represents  the  engines  in  the  cluster.  Within  the  direct  view,  engines  [  0  ] 
represents  the  first  engine,  engines  [  1  ]  the  second  engine,  and  so  on.  The  follow¬ 
ing  commands,  executed  on  the  client  (i.e.,  the  IPython  interpreter),  send  commands 
to  the  first  engine: 

/  \ 

engines  [0].  execute  ('a  =  2  '  ) 
engines  [0]  .  execute  (  'b  =  10') 
engines  [0]  .  execute  (  'c  =  a  +  b') 

V  _ / 

We  may  retrieve  the  result  by  executing  the  following  command  on  the  client: 

f  \ 

engines [0] .pull ( '  c  ' ) 

V  _ / 


12 


Note  that  we  do  not  have  direct  access  to  the  command  line  of  the  first  engine. 
Rather,  we  may  send  commands  to  it  through  the  client. 

What  about  parallelization?  Let  us  try  the  following: 


These  commands  initialize  different  values  for  a  and  b  at  engines  0  and  1  and 
execute  the  sum  at  both  engines.  Since  each  engine  runs  an  independent  process,  the 
operating  system  may  schedule  each  engine  at  different  cores  and  thus  execution  is 
performed  in  parallel.  Again,  as  before,  we  can  retrieve  both  results  using  the  pull 
command: 


204 


11  Parallel  Computing 


In  [  5  ]  : 

Out  [  5  ]  : 

In  [  6  ]  : 

In  [7]  : 


f  \ 

engines  [0:2]  .pull  (  7  c  7  ) 

\ _ / 


[12,  16] 

Note  that  with  these  commands  we  are  directly  accessing  the  engines  and  that  is 
why  this  type  of  approach  is  called  the  direct  view. 

In  order  to  simplify  the  code,  let  us  define  the  following  variables: 


/■ 

\ 

dvi ew2 

=  engines  [0:2] 

dvi  ew 

=  engines . direct_view ( ) 

v 

_ y 

The  variable  dview2  references  the  first  two  engines,  whereas  dview  references 
all  the  current  engines.  This  variable  will  be  used  later  on,  in  Sect.  11.5. 

Let  us  now  try  with  matrix  multiplication.  Assume  we  have  created  four  matrices 
AO,  BO,  A 1,  and  Bl  on  the  client.  The  objective  is  to  compute  the  matrix  products: 
CO  =  A0£0  and  Cl  =  AlBl. 

The  commands  to  be  executed  are  as  follows: 

/  \ 

dview2  .  execute  (  '  import  numpy  as  np  '  ) 

engines  [0]  . push (diet  (A  =  A0  ,  B  =  B0)  ) 
engines  [1]  . push (diet  (A  =  A1  ,  B  =  Bl )  ) 

dview2  .  execute  ('C  =  np  .  dot  (A,  B)  7  ) 

dvi ew2  .pull  (  7  C  7  ) 

V _ / 

Observe  that  the  import  command  has  to  be  run  on  each  of  the  engines  so  that  the 
scientific  computing  library  becomes  available  on  each  engine.  As  before,  the  push 
and  pull  commands  are  used  to  send  and  retrieve  data  between  the  client  and  the 
engines,  and  the  execute  command  computes  the  matrix  product  on  both  engines. 
It  should  be  pointed  out  that  the  push,  execute,  and  pull  commands  block  (i.e., 
they  do  not  return)  until  the  engines  have  completed  their  corresponding  task.  This  is 
due  to  the  attribute  engines  .block  =  True  we  set  when  initializing  the  cluster, 
see  Sect.  11.2.2.  We  may  set  the  attribute  to  False,  in  which  case  the  commands 
will  return  immediately,  without  waiting  for  the  command  to  end.  This  feature  may 
be  very  useful  if  we  want  to  take  full  advantage  of  parallelization  capabilities  and 
performance.  However,  additional  commands  need  to  be  introduced  in  order  to  ensure 
that,  for  instance,  the  execute  command  is  not  issued  before  the  engines  have 
received  the  corresponding  matrices  with  the  push  command.  The  reader  may  find 
more  information  on  this  issue  in  the  corresponding  documentation.  An  example 
of  the  non-blocking  feature  is  shown  in  Sect.  11.5. 

The  previous  examples  show  us  how  to  execute  commands  on  engines  as  if  we 
were  typing  them  directly  into  the  command  line.  Indeed,  we  have  manually  sent, 


3http://ipython.readthedocs.io/en/stable/. 
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In  [8 


In  [  9 


In  [  1 


executed,  and  retrieved  the  results  of  computations.  This  procedure  may  be  useful 
in  some  cases  but  in  many  cases  there  will  be  no  need  for  it.  Indeed,  the  apply 
function  allows  us  to  simplify  such  procedure.  Let  us  see  this  with  the  following 
example: 


\ 

de  f 

mu 1  ( A ,  B  )  : 

import  numpy  as  np 

C  =  np . dot  ( A ,  B  ) 

return  C 

C  = 

engines  [0]  .  apply ( mul  ,  AO  ,  BO  ) 

v 

_ y 

These  commands,  executed  on  the  client,  perform  a  remote  call.  The  function 
mul  is  defined  locally  but  is  executed  on  the  first  engine.  There  is  no  need  to  use 
the  push  and  pull  functions  explicitly  to  send  and  retrieve  the  results;  it  is  done 
implicitly.  All  methods  that  communicate  with  the  engines  are  built  on  top  of  the 
apply  method.  Note  the  import  numpy  as  np  inside  the  function.  This  is  a 
common  model,  to  ensure  that  the  appropriate  toolboxes  are  imported  where  the  task 
is  run. 

If  we  execute  dview2  .  apply  (mul ,  AO ,  BO  )  we  would  execute  the  same 
command  on  engines  0  and  1.  So,  how  can  we  call  up  the  mul  function  and  distribute 
parameters  among  the  engines?  The  direct  view  (and  load-balanced  view,  as  we  will 

see  next)  offers  us  the  map  method  to  tackle  this  issue: 

(  \ 

[  CO  ,  Cl  ]  =  dvi  ew2  .  map  (  mul  ,  [AO  ,  A1],[B0,  Bl]) 

\ _ y 

The  map  call  splits  the  tasks  between  the  engines  associated  with  dview2. 
In  the  previous  example,  the  task  mul (AO , BO )  is  executed  on  one  engine  and 
mul  ( Al ,  Bl )  is  executed  on  the  other  one.  Which  command  is  executed  on  each 
engine?  What  happens  if  the  list  of  arguments  to  map  includes  three  or  more  matrices? 
We  may  see  this  with  the  following  example: 


\ 

engines  [0]  .  execute  (  'my_id  = 

" engine A  "  '  ) 

engines  [1]  .  execute  (  ' my_id  = 

" engine B  "  '  ) 

def  sleep_and_return_id ( sec ) 

: 

import  time 
t ime . s 1 e ep  (sec) 
return  my_id , sec 

dvi ew2  . map  ( s 1 eep_and_r e  turn_ 

id,  [3,  3,  3,  1,  1,  1]) 

V 

y 

Note  that  the  sleep_and_return_id  makes  the  function  sleep  for  the  spec¬ 
ified  amount  of  time  and  returns  the  identifier  of  the  engine  that  has  executed  the 
function.  The  output  is  as  follows: 
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Out [ 10 ] 


In  [11] : 


In  [12] : 


Out [ 12 ] 


[ ( ' engineA' ,  3 ) , 

( ' engineA' ,  3 )  , 

( ' engineA' ,  3 )  , 

( ' engineB ' ,  1 ) , 

( ' engineB ' ,  1 ) , 

( ' engineB ' ,  1 ) ] 

The  previous  output  shows  to  which  engine  each  task  is  assigned.  The  direct 
view  distributes  the  tasks  in  a  uniform  way  among  the  engines  before  execut¬ 
ing  them  no  matter  which  is  the  delay  we  pass  as  argument  to  the  function 
sleep_and_return_id.  Since  the  block  attribute  is  set  to  True,  the  map 
function  blocks  until  all  engines  have  finished  with  their  corresponding  tasks.  This 
is  a  good  way  to  proceed  if  you  expect  each  task  to  take  the  same  amount  of  time. 
But  if  not,  as  is  the  case  in  the  previous  example,  computation  time  is  wasted  and  so 
we  recommend  to  use  the  load-balanced  view  instead. 


1 1 .3.2  Load-Balanced  View  of  Engines 

The  load-balanced  view  is  an  interface  that  allows,  as  does  the  direct  view  interface, 
parallelization  of  tasks.  With  load-balanced  view,  however,  the  user  has  no  direct 
access  to  individual  engines.  It  is  the  IPython  scheduler  that  assigns  work  to  each 
engine.  This  interface  is  simultaneously  simpler  and  more  powerful. 

To  create  a  load-balanced  view  we  may  use  the  following  command: 

f - ^ 

engines . block  =  True 

lview2  =  engines . load_balanced_view ( targets  =  [0,  1]) 

lview  =  engines . load_balanced_view ( ) 

V  _ / 

Again,  we  use  the  blocking  mode  since  it  simplifies  the  code.  As  can  be  seen, 
we  have  defined  two  variables:  lview2  is  a  variable  that  references  the  first  two 
engines,  whereas  lview  references  all  the  engines. 

Our  example  will  be  centered  on  the  sleep_and_return_id  function  we 
saw  in  the  previous  subsection: 

/ - \ 

1 vi ew2 . map ( s 1 eep_and_r e turn_i d ,  [3  ,3  ,3  ,1  ,1  ,  1]) 

V  _ / 

Observe  that  rather  than  using  the  direct  view  interface  (dview2  variable)  of 

the  map  function,  we  use  the  associated  load-balanced  view  interface  (lview2 

variable).  The  output  for  our  execution  is  as  follows: 

[ ( ' engineB ' ,  3 ) , 

( ' engineA' ,  3 ) , 

( ' engineB ' ,  3 ) , 

( ' engineA' ,  1 ) , 

( ' engineA' ,  1 )  , 

( ' engineA ' ,  1 ) ] 
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As  for  the  case  of  the  direct  view,  the  map  function  returns  as  soon  as  all  the  tasks 
have  finished,  since  we  are  using  the  blocking  mode.  The  output  may  vary  each  time 
the  map  function  is  executed.  In  this  case,  the  tasks  are  assigned  to  the  engines  in 
a  dynamic  way.  The  map  function  of  the  load-balanced  view  begins  by  assigning 
one  task  to  each  engine  in  the  order  given  by  the  parameters  of  the  map  function. 
By  default,  the  load-balanced  view  scheduler  then  assigns  a  new  task  to  an  engine 
when  it  becomes  free.  Since  with  the  load-balanced  view  we  do  not  know  on  which 
engine  execution  will  take  place,  explicit  data  movement  methods  like  push  and 
pull  functions  are  not  provided  in  this  view.  The  direct  view  should  be  used  instead 
if  needed. 

The  reader  should  have  noticed  the  simplicity  of  the  IPython  interface  to  parallelize 
tasks.  Once  the  cluster  of  engines  has  been  set  up,  we  may  use  the  map  function  to 
execute  tasks  in  parallel.  This  simplicity  allows  IPython’ s  parallelization  capabilities 
to  be  used  in  distributed  computing.  We  next  offer  an  overview  of  some  of  the 
associated  issues. 


1 1 .4  Distributed  Computing 

The  previous  section  introduced  multicore  computing;  i.e.,  how  to  take  advantage 
of  the  N  multiple  cores  of  a  computer  in  order  to  speed  up  code  execution.  An 
application  that  takes  T  seconds  to  execute  on  a  single  core  could  be  executed  in 
T /N  seconds  if  the  tasks  are  properly  defined.  But  what  if  we  need  to  reduce  the 
computation  time  even  more? 

One  solution  might  be  what  is  called  as  scale-up.  That  is,  buying  a  new  computer 
or  a  new  processor  with  more  cores,  adding  more  memory  to  the  system,  buying 
faster  storage,  and  so  on. 

Another  solution  is  called  scale-out:  interconnecting  multiple  computers  to  make 
them  work  together  to  solve  a  problem.  That  is,  create  a  grid  of  computers.  Grids 
allow  you  to  scale  your  system  to  meet  your  needs:  add  as  many  computers  as  you 
need,  use  all  of  them  or  only  a  few  of  them.  Grids  offer  great  scalability  but  low 
performance;  whereas  supercomputers  give  the  best  performance  values  but  have 
scalability  limitations. 

In  distributed  computing,  the  nodes  work  together  in  order  to  solve  a  problem. 
As  information  is  exchanged  through  the  network,  care  must  be  taken  to  select  the 
amount  of  information  that  is  passed  in  order  to  optimize  computational  performance. 
One  of  the  most  prominent  examples  of  distributed  computing  is  the  SETI@Home 
project:  a  project  that  searches  for  extraterrestrial  life  by  analyzing  radiotelescope 
signals.  For  that,  the  computational  capacity  of  millions  of  computers  belonging  to 
volunteer  users  is  used. 


4Changing  this  behavior  is  beyond  the  scope  of  this  chapter.  You  can  find  more  details  here:  http:// 
ipyparallel.readthedocs.io/en/stable/task.html#schedulers.  Last  seen  November  2015. 
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IPython  offers  the  possibility  of  setting  up  a  cluster  of  engines  running  on  dif¬ 
ferent  computers.  One  way  to  proceed  is  to  use  the  ipc luster  command  (see 
Sect.  11.2.1)  in  SSH  mode;  the  official  documentation  has  examples  of  this.  Config¬ 
uring  IPython  to  work  with  a  grid  of  computers  is  not  as  easy  as  configuring  it  for 
multicore  computing,  so  commercial  platforms  that  offer  the  computational  grid  and 
ease  the  configuration  process  are  also  available. 

All  the  commands  that  are  discussed  in  Sect.  1 1.3  can  also  be  used  in  distributed 
programming.  However,  it  should  be  taken  into  account  that  the  push  and  pull 
commands  send  data  through  the  network.  Sending  many  data  through  the  network 
may  drastically  reduce  the  performance  of  the  system;  thus  data  movement  is  an 
important  issue  to  tackle  in  distributed  computing.  Rather  than  using  push  and 
pull  commands  (either  explicit  or  implicitly),  engines  may  access  the  data  they 
need  directly  on  disk.  Different  approaches  may  be  used  in  this  case;  data  may  be 
stored  in  a  shared  filesystem,  for  instance.  This  approach  is  useful  and  common  if 
computers  are  interconnected  within  a  local  network  but  it  is  difficult  to  implement 
with  computers  connected  in  different  networks.  In  a  shared  filesystem,  the  data  are 
stored  in  a  server  and  thus  each  computer  has  to  connect  with  the  server  and  retrieve 
the  data  needed  from  the  same  server.  This  can  become  a  bottleneck  when  working 
with  many  data. 

Another  approach  is  to  use  a  distributed  filesystem.  In  this  case,  rather  than  storing 
all  the  data  in  a  single  server,  data  are  divided  into  chunks  and  replicated  between 
multiple  computers.  The  data  to  be  processed  are  distributed  and  thus  the  same 
computer  that  stores  the  chunk  can  work  with  it.  This  way  of  proceeding  may  be 
useful  for  Big  Data:  a  broad  term  that  refers  to  the  processing  of  large  datasets. 


1 1 .5  A  Real  Application:  New  York  Taxi  Trips 

This  section  presents  a  real  application  of  the  parallel  capabilities  of  IPython  and 
discussion  of  several  approaches  to  it.  The  dataset  is  a  database  of  taxi  trips  in 
New  York  and  it  has  been  obtained  through  a  Freedom  of  Information  Law  (FOIL) 
request  from  the  New  York  City  Taxi  &  Limousine  Commission  (NYCT&L)  by  the 
University  of  Illinois  at  Urbana-Champaign.  The  dataset  consists  of  12  x  2  Gbyte 
CSV  files.  Each  file  has  approximately  14  million  entries  (lines)  and  is  already 
cleaned.  Thus  no  special  preprocessing  is  needed  to  be  able  to  process  it.  For  our 
purposes,  we  are  only  interested  in  the  following  information  from  each  entry: 

•  pickup_datetime:  start  time  of  the  trip,  mm-dd-yyyy  hh24:mm:ss  EDT. 

•  pickup_longitude  and  pickup_latitude:  GPS  coordinates  at  the  start 
of  the  trip. 


5  http :  //publi  sh  .illinoi  s .  edu/db  work/  open-  data  / . 
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Our  objective  is  to  analyze  these  data  in  order  to  answer  the  following  questions: 
for  each  district,  how  many  pickups  are  performed  during  week  days  and  how  many 
during  weekends?  And  how  many  pickups  are  performed  in  the  morning?  For  this 
issue,  the  city  of  New  York  is  arbitrarily  divided  into  nine  districts:  ChinaTown,  WTC, 
Soho,  Harlem,  UpperTown,  MidTown,  DownTown,  UpperEastSide,  UpperWestSide, 
and  Financial. 

Implementing  the  previous  classification  is  rather  simple  since  it  only  requires 
checking,  for  each  entry,  the  GPS  coordinates  of  the  start  of  the  trip  and  the  pickup 
date  and  time.  Performing  this  task  in  a  sequential  way  may  take  a  rather  long  time, 
since  the  number  of  entries,  for  a  single  CSV  file,  is  rather  large.  In  addition,  special 
care  has  to  be  taken  when  reading  the  file  since  a  2  Gbyte  file  may  not  fit  into  the 
computer’s  memory. 

We  may  take  advantage  of  parallelization  capabilities  in  order  to  reduce  the  pro¬ 
cessing  time.  The  idea  is  to  divide  the  input  data  into  chunks  so  that  each  engine  takes 
care  of  classifying  the  entries  in  their  corresponding  chunks.  A  simple  procedure  may 
follow  from  the  previous  idea:  we  may  explicitly  divide  the  original  2  Gbyte  file  into 
multiple  smaller  files  of  approximately  the  same  number  of  entries.  Such  splitting 
may  be  performed  using,  for  instance,  the  Unix  split  command.  Once  performed, 
each  engine  reads  and  processes  its  chunks  and  the  result  may  be  collected  by  the 
client.  Since  we  expect  each  chunk  to  be  processed  in  the  same  amount  of  time  the 
chunks  may  be  distributed  by  the  client  using  the  map  function  of  the  direct  view. 

Although  straightforward  to  implement,  this  has  several  drawbacks.  Note  that 
the  new  procedure  includes  a  splitting  stage  that  divides  the  input  file  into  multiple 
smaller  files.  Splitting  the  file  implies  accessing  a  disk  for  reading  and  writing, 
and  thus  it  may  reduce  the  overall  possible  improvement,  since  accessing  the  disk  is 
usually  slow  in  comparison  to  CPUs  computing  capabilities.  In  addition,  the  splitting 
process  reads  the  input  file  and  afterwards  each  engine  reads  the  split  data  again  from 
the  disk.  There  is  no  need  to  read  data  twice.  We  may  avoid  reading  the  data  twice  by 
letting  each  engine  read  their  corresponding  chunks  from  the  original  non- split  file. 
However,  this  may  also  reduce  the  overall  improvement  since  it  may  imply  numerous 
movements  of  the  disk  brace  when  data  are  read  from  the  disk  by  multiple  engines. 
Finally,  care  should  be  taken  when  splitting  the  input  file  into  smaller  ones.  Notice 
that  each  engine  will  read  its  assigned  chunk  and  thus  we  must  ensure  that  all  chunks 
read  by  the  engines  fit  into  memory. 


1 1 .5.1  A  Direct  View  Non-Blocking  Proposal 

We  propose  here  a  second  approach  which  avoids  reading  the  data  twice  by  the 
computer.  It  is  based  on  implementing  a  producer-consumer  paradigm  in  order  to 
distribute  the  tasks.  The  producer,  associated  with  the  client,  reads  the  chunks  from 
disk  and  distributes  them  among  the  engines  using  a  round-robin  technique.  No 
explicit  map  function  is  used  in  this  case.  Rather,  we  simulate  the  behavior  of  the 
map  function  in  order  to  have  fine  control  of  the  parallel  problem.  Recall  that  each 
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engine  runs  an  independent  process.  Since  we  assign  different  tasks  to  each  engine, 
the  operating  system  will  try  to  execute  each  engine  via  a  different  process. 

Assume  engines  are  labeled  with  values  1  to  N.  The  proposed  solution,  based  on 
a  round-robin  algorithm,  is  as  follows:  the  client  begins  by  manually  distributing 
a  chunk  to  each  engine  in  an  ordered  way,  from  engine  1  to  engine  N,  and  asking 
them  to  analyze  its  contents.  This  is  performed  in  a  non-blocking  mode:  the  client 
will  not  wait  for  the  task  to  finish  on  one  engine  in  order  to  send  a  chunk  to  the  next 
engine.  Once  a  chunk  has  been  distributed  to  each  engine,  the  client  then  waits  for 
the  engine  1  to  finish.  Once  finished,  it  sends  a  new  chunk  to  it  and  asks  it  to  analyze 
it  without  waiting  for  the  engine  to  finish.  The  client  then  waits  for  the  engine  2 
to  finish,  sends  it  a  new  chunk  and  asks  it  to  process  it,  and  so  on.  The  previous 
procedure  is  repeated  until  all  the  chunks  have  been  sent  to  the  engines.  The  engines 
accumulate  the  overall  partial  result  of  analyzing  their  chunks  in  a  local  variable. 
Once  all  the  engines  have  finished,  the  client  collects  the  partial  results  of  each  engine 
to  compute  the  final  result. 

This  round-robin  technique  is  useful  since  each  engine  receives  a  chunk  of  the 
same  size.  Thus,  each  engine  is  expected  to  take  the  same  amount  of  time  to  process 
its  chunk.  Indeed,  if  all  engines  are  processing  a  chunk,  the  most  likely  engine  to 
finish  first  is  the  one  that,  among  all  engines,  is  next  in  the  round-robin  queue. 

Our  solution  is  based  on  the  direct  view  interface,  see  Sect.  11.3.1.  We  use  the 
direct  view  since  we  would  like  to  have  explicit  access  to  the  engines  in  order  to 
distribute  the  chunks.  We  also  assume  that  one  CSV  file  does  not  fit  into  memory. 
Therefore,  the  client  (i.e.,  the  producer)  will  split  the  input  data  into  uniform  chunks 
of  appropriate  size.  The  whole  implementation  of  the  solution  is  available  as  an 
IPython  notebook.  Here,  we  discuss  only  issues  related  to  parallelization.  Therefore, 
no  number  has  been  assigned  to  the  input  cells. 

First,  let  dvi  ew  be  an  IPython  object  associated  with  all  the  engines  in  the  cluster. 
We  set  the  block  attribute  to  True,  i.e.,  by  default  all  the  commands  that  are  sent  to 
the  engines  will  not  return  until  they  are  finished.  In  order  to  be  able  to  send  tasks  to 
the  engines  in  a  round-robin-like  fashion,  an  infinite  iterator  over  the  list  of  engines 
can  be  created.  This  can  be  done  with  a  Cycle  object: 

r - \ 

from  itertools  import  cycle 
c_engines  =  eye 1 e ( engines . ids ) 

v _ / 

Our  proposal  then  has  the  following  steps,  see  Fig.  11.2: 

1 .  We  begin  by  sending  each  engine  all  the  necessary  functions  that  are  needed  to 
process  the  data.  Of  these  functions,  we  just  mention  init  ( ) ,  which  resets  the 
(local)  engine’s  variables,  and  process  (b),  which  classifies  a  chunk  b  of  lines 
and  groups  the  results  into  a  local_total  variable,  which  is  local  to  each 
engine.  After  sending  the  necessary  functions  to  the  engines,  in  each  engine  we 
execute  the  init  ( )  function,  in  order  to  initialize  the  local  variables  in  each 
engine: 
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Fig.  11. 2  Block  diagram  of  the  algorithm  to  process  databases  with  taxi  trips 


/ - \ 

for  i  in  engines. ids: 

async_tasks [i]  =  engines [i] . execute ( ' init () ' , 

block  =  False) 

\ _ / 


Observe  that  it  is  executed  in  non-blocking  mode.  That  is,  the  init  ( )  function 
is  executed  on  each  engine  without  waiting  for  the  engine  to  finish  and  thus  the 
execute  command  will  return  immediately.  Thus,  the  loop  can  be  executed 
for  each  engine  in  parallel.  In  order  to  know  whether  the  execute  command  has 
finished  for  a  given  engine,  we  will  need  to  check,  when  needed,  the  state  of  the 
corresponding  async_tasks  variable. 

After  performing  this  step  the  client  enters  a  loop  made  up  of  steps  2  to  6  (see 
Fig.  11.2). 

2.  The  client  reads  a  chunk  of  the  file  and  selects  which  engine  the  chunk  will  be 
sent  to: 


r 

new_chunk  = 

get_chunk(f,  1 ines_per_block ) 

\ 

run_eng i ne 

=  c_engines . next ( ) 

v 

_ y 

These  commands  will  be  executed  even  if  the  init  ( )  function  has  not  finished 
or  if  the  engines  have  not  finished  processing  their  previous  chunk.  Each  read 
chunk  will  have  the  same  number  of  lines  (with  the  exception  of  the  last  chunk 
read  from  the  file)  and  thus  we  expect  each  chunk  to  be  processed  in  the  same 
amount  of  time  by  each  engine.  We  therefore  manually  select  the  next  engine  in 
a  round-robin  fashion. 

3.  Once  the  chunk  has  been  read  and  the  engine  that  will  process  the  chunk  has  been 
selected,  we  need  to  wait  for  the  engine  to  finish  its  previous  task.  It  may  still 
be  in  the  initialization  state  or  it  may  be  processing  a  previous  chunk.  While  the 
engine  has  not  finished,  we  wait: 

C  \ 

whi le  (  not  async_tasks[ run_engine ]  .  ready  (  )  )  : 

time .  s 1 eep  (  1 ) 

V _ / 


4.  At  this  point,  we  are  sure  that  the  run_engine  engine  is  free.  Thus,  we  may 
send  the  data  to  the  engine  and  ask  it  to  process  them: 
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f  \ 

mydict  =  diet (data  =  new_chunk) 

engines [run_engine ] . push ( mydict ,  block  =  True) 
async_tasks [run_engine]  =  engines [run_engine] . 

execute ('process (data)',  block  =  False) 

V  _ / 

The  push  is  performed  with  the  default  value  of  block  =  True.  Thus  the 
push  function  will  not  return  until  the  chunk  has  arrived  at  the  engine.  Once 
it  returns,  we  are  sure  that  the  chunk  has  been  received  by  the  engine  and  thus 
we  may  call  the  execute  function.  The  latter  function  will  process  the  data  in 
non-blocking  mode.  Thus,  the  execute  function  will  return  immediately  and 
meanwhile  the  engine  will  process  its  corresponding  block. 

It  should  be  mentioned  that  the  process  function  locally  aggregates  the  results 
of  analyzing  each  chunk  in  the  variable  local_total.  At  the  end,  the  client 
will  collect  the  local  results  from  all  the  engines. 

5.  The  algorithm  then  jumps  again  to  step  2.  The  first  time  step  2  is  executed  the 
selected  engine  is  engine  0.  The  second  time  it  will  be  engine  1  and  so  on.  After 
a  chunk  has  been  assigned  to  all  engines  the  algorithm  will  again  select  engine  0; 
so  it  will  wait  until  engine  0  has  finished  processing  its  previous  chunk. 

6.  Once  the  loop  (steps  2  to  5)  has  processed  all  the  chunks  in  the  file,  the  client  gets 
the  results  from  each  engine  and  aggregates  them  into  the  global_result 
variable.  Before  reading  the  result  we  need  to  be  sure  that  the  engine  has  finished 
with  its  last  chunk: 

/  \ 

for  engine  in  engines  .  ids  : 

whi le  (not  async_tasks  [engine]  .  ready  (  )  )  : 
t ime .  s 1 eep  (  1 ) 

global_result  +=  engines  [engine]  . pull  (  '  local_total  '  , 

block  =  True ) 

V  _ / 

The  pull  is  performed  in  blocking  mode.  After  reading  all  the  results  from  the 
engines  the  final  result  is  stored  in  the  dictionary  global_result. 


11.5.2  Results 

The  experiments  were  performed  on  an  i7-4790  CPU  with  four  physical  cores 
with  HyperThreading  and  8  Gb  of  RAM.  We  performed  experiments  with  differ¬ 
ent  numbers  of  engines  and  different  numbers  of  lines  per  block  (i.e.,  the  vari¬ 
able  lines_per_block  in  the  previous  subsection).  The  performance  results  are 
shown  in  seconds  and  were  obtained  by  computing  the  mean  of  three  executions. 


1 1 .5.2.1  Lines  per  Block 

The  number  of  lines  per  block  defines  the  number  of  data  that  will  be  sent  to  each 
of  the  engines  to  be  processed.  In  order  to  test  the  performance  of  the  algorithm,  we 
performed  tests  with  different  values  of  lines  per  block  and  a  reduced  version  of  one 
CSV  file:  only  1  million  lines  were  processed.  The  experiments  used  8  engines;  i.e., 
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Fig.  11. 3  Performance  to 
process  1  million  lines  of  a 
CSV  file  using  8  engines  for 
different  values  of  lines  per 
block.  Time  is  shown  in 
seconds 


lines  per  block 


the  number  of  processors  of  the  computer.  Thus,  in  our  environment,  there  will  be  a 
total  of  nine  processes  running:  one  producer,  which  is  in  charge  of  reading  the  CSV 
file  and  distributing  the  data  among  the  engines  in  blocks  defined  by  the  variable 
associated  with  lines  per  block,  and  eight  engines  that  will  take  the  blocks  of  data 
from  the  producer  and  process  them. 

The  results  are  shown  in  Fig.  11.3.  As  can  be  seen,  an  optimal  execution  time 
is  located  near  2,000  lines  per  block.  With  fewer  lines  per  block,  efficiency  is  lost 
because  most  of  the  time  engines  are  idle  (thus  cores  are  also  idle),  and  the  system 
wastes  lots  of  computational  time  managing  short  messages  between  processes. 
When  working  with  more  than  6,000  lines  per  block,  the  messages  to  be  passed 
between  processes  are  too  big  to  be  moved  quickly. 

Similar  effects  can  be  found  by  modifying  the  waiting  time  when  an  engine  is 
busy;  see  step  3  in  Sect.  1 1.5.1.  Tests  can  be  done  to  show  that  with  a  shorter  waiting 
time  the  optimal  number  of  lines  per  block  value  is  reduced.  Nevertheless,  optimal 
execution  time  does  not  change  because  the  optimal  execution  time  is  based  on  not 
having  idle  cores. 


1 1 .5.2.2  Number  of  Engines 

The  number  of  engines  is  associated  with  the  level  of  parallelization  that  the  code  can 
reach.  We  tested  our  algorithm  using  2,000  lines  per  block  and  different  numbers 
of  engines,  again  using  a  reduced  version  of  one  CSV  file.  In  this  case,  100,000 
lines  were  processed.  The  result  is  shown  in  Fig.  11.4.  As  can  be  seen,  for  a  given 
number  of  cores,  the  time  that  is  needed  to  process  the  data  reduces  as  the  number 
of  engines  is  increased,  and  the  relation  between  the  number  of  engines  and  time  is 
not  linear.  The  reason  for  this  is  that  the  operating  system  sees  each  engine  as  one 
process  and  thus  each  engine  is  expected  to  be  scheduled  on  different  processors 
of  the  computer.  Note  that  for  one  engine  the  execution  time  is  rather  high;  time  is 
reduced  if  more  engines  are  included  in  the  environment  until  the  number  of  engines 
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Fig.  11. 4  Performance  to 
process  100,000  lines  for 
different  numbers  of  engines 
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number  of  engines 


is  close  to  the  number  of  cores  of  the  computer.  Once  the  minimum  is  reached  (in 
this  case  for  eight  cores)  there  is  no  benefit  from  parallelizing  the  job  with  more 
engines;  on  the  contrary,  with  more  processes,  the  operating  system  scheduler  is 
going  to  spend  more  time  managing  processes  so  the  execution  time  may  increase. 
That  is,  the  operating  system  scheduler  may  become  a  bottleneck.  In  addition,  recall 
that  the  producer  process  in  charge  of  distributing  the  data  among  the  engines  steals 
processing  time  from  the  engines. 


1 1 .5.2.3  Processing  the  Entire  Dataset 

With  this  optimal  value  of  2,000  for  the  lines  per  block  variable  we  executed  our 
algorithm  over  a  whole  CSV  file  made  up  of  14.7  million  lines.  The  execution  time 
with  eight  engines  was  1009  seconds;  and  with  four  engines,  that  time  increased  to 
1895  seconds. 

As  can  be  seen,  increasing  the  number  of  engines  by  a  factor  of  two  does  not 
divide  the  execution  time  by  two.  The  reason  of  this  can  be  explained  by  the  fact 
that  there  is  an  additional  process,  the  producer,  that  distributes  the  blocks  of  lines 
between  the  engines. 


11.6  Conclusions 

This  chapter  has  focused  on  the  parallel  capabilities  of  IPython.  As  has  been  seen, 
IPython  offers  us  an  architecture  that  is  capable  of  supporting  many  styles  of  par¬ 
allelism,  including  multicore  and  distributed  computing.  In  order  to  take  advantage 
of  such  architecture,  the  user  has  to  manually  split  the  task  to  be  performed  into 
multiple  subtasks.  Each  of  these  subtasks  may  then  be  executed  on  different  engines. 


References 


215 


The  direct  view  offers  the  user  the  possibility  of  controlling  which  engine  each  task 
is  sent  to;  whereas  the  load-balanced  view  leaves  this  issue  to  the  scheduler.  The 
former  is  useful  if  the  tasks  to  be  executed  have  similar  computational  cost  or  if  a 
fine  control  over  the  tasks  executed  by  each  engine  is  needed.  The  latter  is  useful 
if  the  tasks  have  different  computational  costs  and  it  does  not  matter  which  engine 
each  task  is  executed  on. 

We  used  the  IPython  parallel  capabilities  to  analyze  a  database  made  up  of  millions 
of  entries.  The  tasks  were  created  by  dividing  the  database  into  chunks  and  assigning, 
in  a  cyclic  manner,  each  of  the  chunks  to  an  engine. 

The  framework  explained  in  this  chapter  is  not  the  only  one  currently  available  for 
IPython  to  take  advantage  of  parallel  computing  capabilities.  For  instance,  Hadoop 
and  Apache  Spark  are  cluster  computing  frameworks  whose  Application  Program¬ 
ming  Interface  is  available  for  the  IPython  notebook.  Thus,  these  frameworks  can  be 
effectively  used  for  data  analysis. 
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